在嵌套的XML File Spark scala中查找特定元素

时间:2017-07-05 14:41:13

标签: xml scala apache-spark

我想选择一个spaecific元素: select("File.columns.column._name")

 |-- File: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- _Description: string (nullable = true)
 |    |    |-- _RowTag: string (nullable = true)
 |    |    |-- _name: string (nullable = true)
 |    |    |-- _type: string (nullable = true)
 |    |    |-- columns: struct (nullable = true)
 |    |    |    |-- column: array (nullable = true)
 |    |    |    |    |-- element: struct (containsNull = true)
 |    |    |    |    |    |-- _Hive_Final_Table: string (nullable = true)
 |    |    |    |    |    |-- _Hive_Final_column: string (nullable = true)
 |    |    |    |    |    |-- _Hive_Table1: string (nullable = true)
 |    |    |    |    |    |-- _Hive_column1: string (nullable = true)
 |    |    |    |    |    |-- _Path: string (nullable = true)
 |    |    |    |    |    |-- _Type: string (nullable = true)
 |    |    |    |    |    |-- _VALUE: string (nullable = true)
 |    |    |    |    |    |-- _name: string (nullable = true)

我收到了这个错误:

  

线程“main”中的异常org.apache.spark.sql.AnalysisException:   由于数据类型不匹配,无法解析'File.columns.column [_name]':   参数2需要整数类型,但是'_name'是字符串   类型。;           在org.apache.spark.sql.catalyst.analysis.package $ AnalysisErrorAt.failAnalysis(package.scala:42)           在org.apache.spark.sql.catalyst.analysis.CheckAnalysis $$ anonfun $ checkAnalysis $ 1 $$ anonfun $ apply $ 2.applyOrElse(CheckAnalysis.scala:65)           在org.apache.spark.sql.catalyst.analysis.CheckAnalysis $$ anonfun $ checkAnalysis $ 1 $$ anonfun $ apply $ 2.applyOrElse(CheckAnalysis.scala:57)           在org.apache.spark.sql.catalyst.trees.TreeNode $$ anonfun $ transformUp $ 1.apply(TreeNode.scala:335)           在org.apache.spark.sql.catalyst.trees.TreeNode $$ anonfun $ transformUp $ 1.apply(TreeNode.scala:335)           at org.apache.spark.sql.catalyst.trees.CurrentOrigin $ .withOrigin(TreeNode.scala:69)           在org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:334)           在org.apache.spark.sql.catalyst.trees.TreeNode $$ anonfun $ 5.apply(TreeNode.scala:332)           在org.apache.spark.sql.catalyst.trees.TreeNode $$ anonfun $ 5.apply(TreeNode.scala:332)           在org.apache.spark.sql.catalyst.trees.TreeNode $$ anonfun $ 4.apply(TreeNode.scala:281)           在scala.collection.Iterator $$ anon $ 11.next(Iterator.scala:328)           在scala.collection.Iterator $ class.foreach(Iterator.scala:727)           在scala.collection.AbstractIterator.foreach(Iterator.scala:1157)           在scala.collection.generic.Growable $ class。$ plus $ plus $ eq(Growable.scala:48)           在scala.collection.mutable.ArrayBuffer。$ plus $ plus $ eq(ArrayBuffer.scala:103)           在scala.collection.mutable.ArrayBuffer。$ plus $ plus $ eq(ArrayBuffer.scala:47)           在scala.collection.TraversableOnce $ class.to(TraversableOnce.scala:273)           在scala.collection.AbstractIterator.to(Iterator.scala:1157)           在scala.collection.TraversableOnce $ class.toBuffer(TraversableOnce.scala:265)           在scala.collection.AbstractIterator.toBuffer(Iterator.scala:1157)           在scala.collection.TraversableOnce $ class.toArray(TraversableOnce.scala:252)           在scala.collection.AbstractIterator.toArray(Iterator.scala:1157)           在org.apache.spark.sql.catalyst.trees.TreeNode.transformChildren(TreeNode.scala:321)           在org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:332)           在org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpressionUp $ 1(QueryPlan.scala:108)           在org.apache.spark.sql.catalyst.plans.QueryPlan.org $ apache $ spark $ sql $ catalyst $ plans $ QueryPlan $$ recursiveTransform $ 2(QueryPlan.scala:118)

你能帮我吗?

2 个答案:

答案 0 :(得分:0)

您需要explode函数才能获得所需的列

  

explode(Column e)       为给定数组或映射列中的每个元素创建一个新行。

val df1 = df.select(explode($"File").as("File")).select($"File.columns").as("column")

首先,explode会为您提供column字段

val finalDF = df1.select(explode($"(column"))."column")).select($"column._name").as("_name")

第二次爆炸会为您提供_name

希望这有帮助!

答案 1 :(得分:0)

查看您的架构,您可以执行以下操作,从_name

的嵌套结构中选择dataframe
import org.apache.spark.sql.functions._
df.select(col("File.columns.column")(0)(0)("_name").as("_name"))