我已经从xml文件中加载了数据:
val xmlContent=spark.sqlContext.read.format("com.databricks.spark.xml").option("rowTag","GROUP.NOTES").load("/datalake/other/decomlake/spark-xml-poc/Sample.xml")
xmlContent schema is as following:
xmlContent.printSchema
root
|-- GROUP.NOTES-ROW: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- XML_COLUMN_1_TEXT-MV: struct (nullable = true)
| | | |-- XML_COLUMN_1_TEXT: array (nullable = true)
| | | | |-- element: struct (containsNull = true)
| | | | | |-- _VALUE: string (nullable = true)
| | | | | |-- _val: long (nullable = true)
| | |-- XML_COLUMN_2_TEXT-MV: struct (nullable = true)
| | | |-- XML_COLUMN_2_TEXT: array (nullable = true)
| | | | |-- element: struct (containsNull = true)
| | | | | |-- _VALUE: string (nullable = true)
| | | | | |-- _val: long (nullable = true)
| | |-- XML_COLUMN_3_TEXT-MV: struct (nullable = true)
| | | |-- XML_COLUMN_3_TEXT: array (nullable = true)
| | | | |-- element: struct (containsNull = true)
| | | | | |-- _VALUE: string (nullable = true)
| | | | | |-- _val: long (nullable = true)
| | |-- XML_FLD001: string (nullable = true)
| | |-- XML_FLD002: string (nullable = true)
| | |-- XML_FLD004: string (nullable = true)
| | |-- XML_FLD006: string (nullable = true)
| | |-- XML_FLD007: string (nullable = true)
| | |-- XML_ID: string (nullable = true)
| | |-- _confidential: string (nullable = true)
|-- _account: string (nullable = true)
|-- _area: string (nullable = true)
|-- _exbatch: string (nullable = true)
|-- _filename: string (nullable = true)
|-- _mahptablename: string (nullable = true)
|-- _subaccount: string (nullable = true)
我能够获得列的价值: _帐户, _区域, _exbatch, _文件名, _mahptablename, _subaccount
但无法获取列 GROUP.NOTES-ROW 的值,因为出现以下错误
val groupNoteDf=xmlContent.select("GROUP.NOTES-ROW").show()
org.apache.spark.sql.AnalysisException: cannot resolve '`GROUP.NOTES-ROW`' given input columns: [_exbatch, _account, _area, GROUP.NOTES-ROW, _subaccount, _mahptablename, _filename];;
'Project ['GROUP.NOTES-ROW]
+- Relation[GROUP.NOTES-ROW#0,_account#1,_area#2,_exbatch#3,_filename#4,_mahptablename#5,_subaccount#6] XmlRelation(<function0>,Some(/datalake/other/decomlake/spark-xml-poc/Sample.xml),Map(rowtag -> GROUP.NOTES, path -> /datalake/other/decomlake/spark-xml-poc/Sample.xml),null)
at org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.failAnalysis(package.scala:42)
at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:89)
at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:86)
at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:289)
at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:289)
at org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:70)
at org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:288)
at org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$transformExpressionsUp$1.apply(QueryPlan.scala:268)
at org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$transformExpressionsUp$1.apply(QueryPlan.scala:268)
at org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpression$1(QueryPlan.scala:279)
at org.apache.spark.sql.catalyst.plans.QueryPlan.org$apache$spark$sql$catalyst$plans$QueryPlan$$recursiveTransform$1(QueryPlan.scala:289)
at org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$org$apache$spark$sql$catalyst$plans$QueryPlan$$recursiveTransform$1$1.apply(QueryPlan.scala:293)
at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
at scala.collection.AbstractTraversable.map(Traversable.scala:104)
at org.apache.spark.sql.catalyst.plans.QueryPlan.org$apache$spark$sql$catalyst$plans$QueryPlan$$recursiveTransform$1(QueryPlan.scala:293)
at org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$6.apply(QueryPlan.scala:298)
at org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:187)
at org.apache.spark.sql.catalyst.plans.QueryPlan.mapExpressions(QueryPlan.scala:298)
at org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpressionsUp(QueryPlan.scala:268)
at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:86)
at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:79)
at org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:127)
at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.checkAnalysis(CheckAnalysis.scala:79)
at org.apache.spark.sql.catalyst.analysis.Analyzer.checkAnalysis(Analyzer.scala:91)
我正在使用databricks api解析和加载xml文件 预先感谢任何人可以帮助我找到解决方法
答案 0 :(得分:0)
您需要Backticks`来避免连字符-在您的列名中。
xmlContent.select("`GROUP.NOTES-ROW`").show()
此外,show()
返回Unit
,请不要将其分配给任何变量。您可以使用上面的语句直接查看您的数据框。如果要通过将其分配给变量来创建新的数据框,请不要使用show()
。