无法从Spark数据框中获取列值

时间:2018-07-31 06:25:23

标签: scala apache-spark apache-spark-sql

我已经从xml文件中加载了数据:

 val xmlContent=spark.sqlContext.read.format("com.databricks.spark.xml").option("rowTag","GROUP.NOTES").load("/datalake/other/decomlake/spark-xml-poc/Sample.xml")

xmlContent schema is as following:
 xmlContent.printSchema
root
 |-- GROUP.NOTES-ROW: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- XML_COLUMN_1_TEXT-MV: struct (nullable = true)
 |    |    |    |-- XML_COLUMN_1_TEXT: array (nullable = true)
 |    |    |    |    |-- element: struct (containsNull = true)
 |    |    |    |    |    |-- _VALUE: string (nullable = true)
 |    |    |    |    |    |-- _val: long (nullable = true)
 |    |    |-- XML_COLUMN_2_TEXT-MV: struct (nullable = true)
 |    |    |    |-- XML_COLUMN_2_TEXT: array (nullable = true)
 |    |    |    |    |-- element: struct (containsNull = true)
 |    |    |    |    |    |-- _VALUE: string (nullable = true)
 |    |    |    |    |    |-- _val: long (nullable = true)
 |    |    |-- XML_COLUMN_3_TEXT-MV: struct (nullable = true)
 |    |    |    |-- XML_COLUMN_3_TEXT: array (nullable = true)
 |    |    |    |    |-- element: struct (containsNull = true)
 |    |    |    |    |    |-- _VALUE: string (nullable = true)
 |    |    |    |    |    |-- _val: long (nullable = true)
 |    |    |-- XML_FLD001: string (nullable = true)
 |    |    |-- XML_FLD002: string (nullable = true)
 |    |    |-- XML_FLD004: string (nullable = true)
 |    |    |-- XML_FLD006: string (nullable = true)
 |    |    |-- XML_FLD007: string (nullable = true)
 |    |    |-- XML_ID: string (nullable = true)
 |    |    |-- _confidential: string (nullable = true)
 |-- _account: string (nullable = true)
 |-- _area: string (nullable = true)
 |-- _exbatch: string (nullable = true)
 |-- _filename: string (nullable = true)
 |-- _mahptablename: string (nullable = true)
 |-- _subaccount: string (nullable = true)

我能够获得列的价值: _帐户, _区域, _exbatch, _文件名, _mahptablename, _subaccount

但无法获取列 GROUP.NOTES-ROW 的值,因为出现以下错误

 val groupNoteDf=xmlContent.select("GROUP.NOTES-ROW").show()
org.apache.spark.sql.AnalysisException: cannot resolve '`GROUP.NOTES-ROW`' given input columns: [_exbatch, _account, _area, GROUP.NOTES-ROW, _subaccount, _mahptablename, _filename];;
'Project ['GROUP.NOTES-ROW]
+- Relation[GROUP.NOTES-ROW#0,_account#1,_area#2,_exbatch#3,_filename#4,_mahptablename#5,_subaccount#6] XmlRelation(<function0>,Some(/datalake/other/decomlake/spark-xml-poc/Sample.xml),Map(rowtag -> GROUP.NOTES, path -> /datalake/other/decomlake/spark-xml-poc/Sample.xml),null)

  at org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.failAnalysis(package.scala:42)
  at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:89)
  at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:86)
  at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:289)
  at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:289)
  at org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:70)
  at org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:288)
  at org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$transformExpressionsUp$1.apply(QueryPlan.scala:268)
  at org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$transformExpressionsUp$1.apply(QueryPlan.scala:268)
  at org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpression$1(QueryPlan.scala:279)
  at org.apache.spark.sql.catalyst.plans.QueryPlan.org$apache$spark$sql$catalyst$plans$QueryPlan$$recursiveTransform$1(QueryPlan.scala:289)
  at org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$org$apache$spark$sql$catalyst$plans$QueryPlan$$recursiveTransform$1$1.apply(QueryPlan.scala:293)
  at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
  at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
  at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
  at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
  at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
  at scala.collection.AbstractTraversable.map(Traversable.scala:104)
  at org.apache.spark.sql.catalyst.plans.QueryPlan.org$apache$spark$sql$catalyst$plans$QueryPlan$$recursiveTransform$1(QueryPlan.scala:293)
  at org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$6.apply(QueryPlan.scala:298)
  at org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:187)
  at org.apache.spark.sql.catalyst.plans.QueryPlan.mapExpressions(QueryPlan.scala:298)
  at org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpressionsUp(QueryPlan.scala:268)
  at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:86)
  at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:79)
  at org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:127)
  at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.checkAnalysis(CheckAnalysis.scala:79)
  at org.apache.spark.sql.catalyst.analysis.Analyzer.checkAnalysis(Analyzer.scala:91)

我正在使用databricks api解析和加载xml文件 预先感谢任何人可以帮助我找到解决方法

1 个答案:

答案 0 :(得分:0)

您需要Backticks`来避免连字符-在您的列名中。

xmlContent.select("`GROUP.NOTES-ROW`").show()

此外,show()返回Unit,请不要将其分配给任何变量。您可以使用上面的语句直接查看您的数据框。如果要通过将其分配给变量来创建新的数据框,请不要使用show()