我有一个这样的文件:
1,<note><from>Messi</from><body>Don't forget me this weekend!</body></note>
2,<note><from>Ronaldo</from><body>Don't forget Laliga</body></note>
3,<note><from>Neymar</from><body>I am the best </body></note>
4,<note><from>Suarez</from><body>Don't forget me this weekend!</body></note>
其中第一个字段是id,第二个字段是数据。我需要将其加载到RDD,解析xml字符串并提取字段,并创建另一个RDD,如下所示:
1,Messi,Don't forget me this weekend!
2,Ronaldo,Don't forget Laliga
3,Neymar,I am the best
4,Suarez,Don't forget me this weekend!
由于实际场景中的xml很复杂,我想使用xml解析器。我怎么能这样做?
答案 0 :(得分:2)
您可以使用Scala自己的XML库。但是,在执行此操作之前,您需要将字符串解析为Elem
对象:
import scala.xml._
val str = "<note><from>Messi</from><body>Don't forget me this weekend!</body></note>"
val xml = XML.loadString(xml)
xml: scala.xml.Elem = <note><from>Messi</from><body>Don't forget me this weekend!</body></note>
要提取单个元素,请使用:
xml \\ "note" \\ "from"
res19: scala.xml.NodeSeq = NodeSeq(<from>Messi</from>)
这会产生类型为NodeSeq
的对象,以获取字符串,使用:
(xml \\ "note" \\ "from").text
res20: String = Messi
来到你的问题
val rdd = sc.parallelize(Array(
(1,"<note><from>Messi</from><body>Don't forget me this weekend!</body></note>"),
(2,"<note><from>Ronaldo</from><body>Don't forget La Liga</body></note>"),
(3,"<note><from>Neymar</from><body>I am the best </body></note>"),
(4,"<note><from>Suarez</from><body>Don't forget me this weekend!</body></note>")
))
rdd.map{ case (id, xml) =>
(id ,
(XML.loadString(xml) \\ "note" \\ "from").text ,
(XML.loadString(xml) \\ "note" \\ "body").text )
}.collect.foreach(println)
(1,Messi,Don't forget me this weekend!)
(2,Ronaldo,Don't forget Laliga)
(3,Neymar,I am the best )
(4,Suarez,Don't forget me this weekend!)