通过Spark SQL读取嵌套JSON - [AnalysisException]无法解析Column

时间:2017-05-06 06:32:13

标签: json scala apache-spark apache-spark-sql

我有这样的JSON数据:

{  
   "parent":[  
      {  
         "prop1":1.0,
         "prop2":"C",
         "children":[  
            {  
               "child_prop1":[  
                  "3026"
               ]
            }
         ]
      }
   ]
}

从Spark读取数据后,我得到以下架构:

val df = spark.read.json("test.json")
df.printSchema
root
 |-- parent: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- children: array (nullable = true)
 |    |    |    |-- element: struct (containsNull = true)
 |    |    |    |    |-- child_prop1: array (nullable = true)
 |    |    |    |    |    |-- element: string (containsNull = true)
 |    |    |-- prop1: double (nullable = true)
 |    |    |-- prop2: string (nullable = true)

现在,我想从child_prop1中选择df。但是当我尝试选择它时,我得到org.apache.spark.sql.AnalysisException。像这样:

df.select("parent.children.child_prop1")
org.apache.spark.sql.AnalysisException: cannot resolve '`parent`.`children`['child_prop1']' due to data type mismatch: argument 2 requires integral type, however, ''child_prop1'' is of string type.;;
'Project [parent#60.children[child_prop1] AS child_prop1#63]
+- Relation[parent#60] json

  at org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.failAnalysis(package.scala:42)
  at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:82)
  at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:74)
  at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:310)
  at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:310)
  at org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:70)
  at org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:309)
  at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:307)
  at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:307)
  at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$5.apply(TreeNode.scala:331)
  at org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:188)
  at org.apache.spark.sql.catalyst.trees.TreeNode.transformChildren(TreeNode.scala:329)
  at org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:307)
  at org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpressionUp$1(QueryPlan.scala:282)
  at org.apache.spark.sql.catalyst.plans.QueryPlan.org$apache$spark$sql$catalyst$plans$QueryPlan$$recursiveTransform$2(QueryPlan.scala:292)
  at org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$org$apache$spark$sql$catalyst$plans$QueryPlan$$recursiveTransform$2$1.apply(QueryPlan.scala:296)
  at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
  at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
  at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
  at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
  at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
  at scala.collection.AbstractTraversable.map(Traversable.scala:104)
  at org.apache.spark.sql.catalyst.plans.QueryPlan.org$apache$spark$sql$catalyst$plans$QueryPlan$$recursiveTransform$2(QueryPlan.scala:296)
  at org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$7.apply(QueryPlan.scala:301)
  at org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:188)
  at org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpressionsUp(QueryPlan.scala:301)
  at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:74)
  at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:67)
  at org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:128)
  at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.checkAnalysis(CheckAnalysis.scala:67)
  at org.apache.spark.sql.catalyst.analysis.Analyzer.checkAnalysis(Analyzer.scala:57)
  at org.apache.spark.sql.execution.QueryExecution.assertAnalyzed(QueryExecution.scala:48)
  at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:63)
  at org.apache.spark.sql.Dataset.org$apache$spark$sql$Dataset$$withPlan(Dataset.scala:2822)
  at org.apache.spark.sql.Dataset.select(Dataset.scala:1121)
  at org.apache.spark.sql.Dataset.select(Dataset.scala:1139)
  ... 48 elided

虽然,当我从children中仅选择df时,它可以正常工作。

df.select("parent.children").show(false)
+------------------------------------+
|children                            |
+------------------------------------+
|[WrappedArray([WrappedArray(3026)])]|
+------------------------------------+

即使列存在于数据帧中,我也无法理解它为什么会发出异常。

感谢任何帮助!

2 个答案:

答案 0 :(得分:3)

您的Json是一个有效的json,我认为您不需要更改输入数据。

使用explode获取数据

import org.apache.spark.sql.functions.explode

val data = spark.read.json("src/test/java/data.json")
val child = data.select(explode(data("parent.children"))).toDF("children")

child.select(explode(child("children.child_prop1"))).toDF("child_prop1").show()

如果您可以更改输入数据,则可以关注@ramesh建议

答案 1 :(得分:1)

如果您查看架构select a.product,b.wharhouse from (select * from warhose) a, (select * from warhose) b where a.product=b.product and a.wharhouse <> b.wharhouse order by a.product,a.wharhouse desc 位于根阵列child_prop1的{​​{1}}内。因此,我们需要能够定义nested array的{​​{1}},以及错误建议您定义的内容。
转换parent格式应该可以解决问题。
position更改为

child_prop1

并应用

json

将输出

json

并且
{"parent":{"prop1":1.0,"prop2":"C","children":{"child_prop1":["3026"]}}} 更改为

df.select("parent.children.child_prop1").show(false)

并应用

+-----------+
|child_prop1|
+-----------+
|[3026]     |
+-----------+

将导致

json

我希望答案有帮助