如何使用DataFrame和com.databricks.spark.xml格式访问XML的属性?

时间:2017-02-21 13:09:56

标签: xml scala apache-spark dataframe

我有一个xml文件,如下所示:

<?xml version="1.0" encoding="UTF-8"?>
<paml version="2.0" xmlns="paml20.xsd">
  <kmData type="partial">
    <header>
      <log dateTime="2016-11-10T07:01:37" action="created">partial used</log>
    </header>
    <Object class="SSC" version="0.3" dName="p2345" id="600">
    <list name="sscOptions">
        <p>0</p>
        <p>1</p>
        <p>2</p>
        <p>3</p>
        <p>4</p>
      </list>
    <p name="AAA">2</p>
      <p name="BBB">3</p>
      <p name="CCC">NNN</p>
      <p name="DDD">26</p>
      <p name="EEE">30</p>
      <p name="FFF">30</p>
      <p name="GGG">80</p>
      <p name="HHH">20</p>
      <p name="III">100</p>
      </Object>
    <Object class="PLUS2" version="0.5" dName="p2346" id="700">
      <p name="AAA">5</p>
      <p name="BBB">1</p>
      <p name="CCC">0</p>
      <p name="DDD">0</p>
      <p name="EEE">0</p>
      <p name="FFF">0</p>
      <list name="PLUS2Out">
        <p>0</p>
        <p>0</p>
        <p>0</p>
        <p>0</p>
        <p>0</p>
        <p>0</p>
      </list>
      <p name="GGG">8</p>
      </Object>
   </kmData>
 </paml>

我想提取

dateTime,class,version,dName,id,AAA,CCC from Object class ="SSC" and 
dateTime,class,version,dName,id,AAA,BBB,CCC from object class PLUS2.

我想写一个文件。

我已尝试过以下代码。

package Dataframeparsing
import org.apache.spark.SparkContext
import org.apache.spark.SparkConf
import scala.xml.XML

object Dataframeparse {
  def main(args: Array[String]) {
  val sc = new SparkContext(new SparkConf().setAppName("Parse XML Data").setMaster("local[*]"))

val sqlContext = new org.apache.spark.sql.SQLContext(sc)
val df = sqlContext.read
    .format("com.databricks.spark.xml")
    .option("rowTag", "Object")
    .load("D://userdata//sam//Desktop//abc.xml")
    print("xml file read done")
    val Data1 = df.filter("class = 'SSC'")
    val D1store=Data1.select("AAA","CCC").show()
    D1store.write.option("header", "true").csv("file:///C:/out1.csv")
    val Data2 = df.filter("class = 'PLUS2'")
    val D2store=Data2.select("AAA","BBB","CCC").show()
    D2store..write.option("header", "true").csv("file:///C:/out2.csv")
}
}

当我尝试上面的代码时,我收到以下错误:

  

17/02/21 18:16:12 INFO DAGScheduler:ResultStage 0(treeAggregate at InferSchema.scala:60)在14.905秒完成   17/02/21 18:16:12 INFO DAGScheduler:作业0完成:treeAggregate at InferSchema.scala:60,花了15.088996 s       xml文件读取线程“main”中的doneException org.apache.spark.sql.AnalysisException:无法解析'AAA'给定的输入列p,class,dName,defaults,list,version,id;           在org.apache.spark.sql.catalyst.analysis.package $ AnalysisErrorAt.failAnalysis(package.scala:42)

我想要最终输出如下:

class=SSC

2016-11-10T07:01:37,SSC,0.3,p2345,600,2,NNN

class=PLUS2

2016-11-10T07:01:37,PLUS2,0.5,p2346,700,5,1,0

0 个答案:

没有答案