在apache spark scala中处理带有xml列的文本文件

时间:2017-09-12 02:08:14

标签: xml scala apache-spark

我有一个这样的文件:

1,<note><from>Messi</from><body>Don't forget me this weekend!</body></note>
2,<note><from>Ronaldo</from><body>Don't forget Laliga</body></note>
3,<note><from>Neymar</from><body>I am the best </body></note>
4,<note><from>Suarez</from><body>Don't forget me this weekend!</body></note>

其中第一个字段是id,第二个字段是数据。我需要将其加载到RDD,解析xml字符串并提取字段,并创建另一个RDD,如下所示:

1,Messi,Don't forget me this weekend!
2,Ronaldo,Don't forget Laliga
3,Neymar,I am the best 
4,Suarez,Don't forget me this weekend!

由于实际场景中的xml很复杂,我想使用xml解析器。我怎么能这样做?

1 个答案:

答案 0 :(得分:2)

您可以使用Scala自己的XML库。但是,在执行此操作之前,您需要将字符串解析为Elem对象:

import scala.xml._

val str = "<note><from>Messi</from><body>Don't forget me this weekend!</body></note>"

val xml = XML.loadString(xml)
xml: scala.xml.Elem = <note><from>Messi</from><body>Don't forget me this weekend!</body></note>

要提取单个元素,请使用:

xml \\ "note" \\ "from"
res19: scala.xml.NodeSeq = NodeSeq(<from>Messi</from>)

这会产生类型为NodeSeq的对象,以获取字符串,使用:

(xml \\ "note" \\ "from").text
res20: String = Messi

来到你的问题

val rdd = sc.parallelize(Array(
(1,"<note><from>Messi</from><body>Don't forget me this weekend!</body></note>"),
(2,"<note><from>Ronaldo</from><body>Don't forget La Liga</body></note>"),
(3,"<note><from>Neymar</from><body>I am the best </body></note>"),
(4,"<note><from>Suarez</from><body>Don't forget me this weekend!</body></note>")
)) 

rdd.map{ case (id, xml) => 
    (id , 
    (XML.loadString(xml) \\ "note" \\ "from").text , 
    (XML.loadString(xml) \\ "note" \\ "body").text ) 
}.collect.foreach(println)

(1,Messi,Don't forget me this weekend!)
(2,Ronaldo,Don't forget Laliga)
(3,Neymar,I am the best )
(4,Suarez,Don't forget me this weekend!)