我有一个xml文件,如下所示:
<?xml version="1.0" encoding="UTF-8"?>
<paml version="2.0" xmlns="paml20.xsd">
<kmData type="partial">
<header>
<log dateTime="2016-11-10T07:01:37" action="created">partial used</log>
</header>
<Object class="SSC" version="0.3" dName="p2345" id="600">
<list name="sscOptions">
<p>0</p>
<p>1</p>
<p>2</p>
<p>3</p>
<p>4</p>
</list>
<p name="AAA">2</p>
<p name="BBB">3</p>
<p name="CCC">NNN</p>
<p name="DDD">26</p>
<p name="EEE">30</p>
<p name="FFF">30</p>
<p name="GGG">80</p>
<p name="HHH">20</p>
<p name="III">100</p>
</Object>
<Object class="PLUS2" version="0.5" dName="p2346" id="700">
<p name="AAA">5</p>
<p name="BBB">1</p>
<p name="CCC">0</p>
<p name="DDD">0</p>
<p name="EEE">0</p>
<p name="FFF">0</p>
<list name="PLUS2Out">
<p>0</p>
<p>0</p>
<p>0</p>
<p>0</p>
<p>0</p>
<p>0</p>
</list>
<p name="GGG">8</p>
</Object>
</kmData>
</paml>
我想提取
dateTime,class,version,dName,id,AAA,CCC from Object class ="SSC" and
dateTime,class,version,dName,id,AAA,BBB,CCC from object class PLUS2.
我想写一个文件。
我已尝试过以下代码。
package Dataframeparsing
import org.apache.spark.SparkContext
import org.apache.spark.SparkConf
import scala.xml.XML
object Dataframeparse {
def main(args: Array[String]) {
val sc = new SparkContext(new SparkConf().setAppName("Parse XML Data").setMaster("local[*]"))
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
val df = sqlContext.read
.format("com.databricks.spark.xml")
.option("rowTag", "Object")
.load("D://userdata//sam//Desktop//abc.xml")
print("xml file read done")
val Data1 = df.filter("class = 'SSC'")
val D1store=Data1.select("AAA","CCC").show()
D1store.write.option("header", "true").csv("file:///C:/out1.csv")
val Data2 = df.filter("class = 'PLUS2'")
val D2store=Data2.select("AAA","BBB","CCC").show()
D2store..write.option("header", "true").csv("file:///C:/out2.csv")
}
}
当我尝试上面的代码时,我收到以下错误:
17/02/21 18:16:12 INFO DAGScheduler:ResultStage 0(treeAggregate at InferSchema.scala:60)在14.905秒完成 17/02/21 18:16:12 INFO DAGScheduler:作业0完成:treeAggregate at InferSchema.scala:60,花了15.088996 s xml文件读取线程“main”中的doneException org.apache.spark.sql.AnalysisException:无法解析'AAA'给定的输入列p,class,dName,defaults,list,version,id; 在org.apache.spark.sql.catalyst.analysis.package $ AnalysisErrorAt.failAnalysis(package.scala:42)
我想要最终输出如下:
class=SSC
2016-11-10T07:01:37,SSC,0.3,p2345,600,2,NNN
class=PLUS2
2016-11-10T07:01:37,PLUS2,0.5,p2346,700,5,1,0