Spark DataFrame中找不到嵌套列的路径

时间:2016-11-05 11:04:49

标签: apache-spark xml-parsing apache-spark-sql spark-dataframe

我遇到了问题,我不确定Spark DataFrames是否是我用来将XML文件解析为Spark的问题或spark-xml。我真的很感激任何帮助。

所以,我有以下XML:

<root>
  <path>
    <to>
      <atag>
        <atag_number>1</atag_number>
        <more>
          <again>
            <text>1111</text>
          </again>
        </more>
        <more>
          <again>
            <text>2222</text>
          </again>
        </more>
        <more>
          <again>
            <text>3333</text>
          </again>
        </more>
      </atag>
      <atag>
        <atag_number>2</atag_number>
        <more>
          <again>
            <text>4444</text>
          </again>
        </more>
        <more>
          <again>
            <text>5555</text>
          </again>
        </more>
        <more>
          <again>
            <text>6666</text>
          </again>
        </more>
      </atag>
    </to>
  </path>
</root>

我希望得到一个包含path.to.atag.more.again.text的表格。我希望它们是原子的,因此需要进行爆炸以获得每个text值的行。

如果我选择例如path.to.atag[0].more.again.text,我得到了一个列表[&#39; 1111&#39;,&#39; 2222&#39;,&#39; 3333&#39;]。 但如果我想要文件中的所有标签,那么如果我选择path.to.atag.more.again.text,我会收到错误,说:

Traceback (most recent call last):
  File "...\spark-2.0.1-bin-hadoop2.7\python\pyspark\sql\utils.py", line 63, in deco
    return f(*a, **kw)
  File "...\spark-2.0.1-bin-hadoop2.7\python\lib\py4j-0.10.3-src.zip\py4j\protocol.py", line 319, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling o34.selectExpr.
: org.apache.spark.sql.AnalysisException: No such struct field text in again; line 1 pos 0

    at org.apache.spark.sql.catalyst.expressions.ExtractValue$.findField(complexTypeExtractors.scala:85)

    at org.apache.spark.sql.catalyst.expressions.ExtractValue$.apply(complexTypeExtractors.scala:58)

    at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$3.apply(LogicalPlan.scala:253)

    at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$3.apply(LogicalPlan.scala:252)

    at scala.collection.LinearSeqOptimized$class.foldLeft(LinearSeqOptimized.scala:124)

    at scala.collection.immutable.List.foldLeft(List.scala:84)

    at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolve(LogicalPlan.scala:252)

    at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolveChildren(LogicalPlan.scala:148)

    at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$$anonfun$apply$9$$anonfun$applyOrElse$5$$anonfun$31.apply(Analyzer.scala:604)

    at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$$anonfun$apply$9$$anonfun$applyOrElse$5$$anonfun$31.apply(Analyzer.scala:604)

    at org.apache.spark.sql.catalyst.analysis.package$.withPosition(package.scala:48)

    at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$$anonfun$apply$9$$anonfun$applyOrElse$5.applyOrElse(Analyzer.scala:604)

    at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$$anonfun$apply$9$$anonfun$applyOrElse$5.applyOrElse(Analyzer.scala:600)

    at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:301)

    at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:301)

    at org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:69)

    at org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:300)

    at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:298)

    at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:298)

    at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$5.apply(TreeNode.scala:321)

    at org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:179)

    at org.apache.spark.sql.catalyst.trees.TreeNode.transformChildren(TreeNode.scala:319)

    at org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:298)

    at org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpressionUp$1(QueryPlan.scala:191)

    at org.apache.spark.sql.catalyst.plans.QueryPlan.org$apache$spark$sql$catalyst$plans$QueryPlan$$recursiveTransform$2(QueryPlan.scala:201)

    at org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$org$apache$spark$sql$catalyst$plans$QueryPlan$$recursiveTransform$2$1.apply(QueryPlan.scala:205)

    at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)

    at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)

    at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)

    at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)

    at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)

    at scala.collection.AbstractTraversable.map(Traversable.scala:104)

    at org.apache.spark.sql.catalyst.plans.QueryPlan.org$apache$spark$sql$catalyst$plans$QueryPlan$$recursiveTransform$2(QueryPlan.scala:205)

    at org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$5.apply(QueryPlan.scala:210)

    at org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:179)

    at org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpressionsUp(QueryPlan.scala:210)

    at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$$anonfun$apply$9.applyOrElse(Analyzer.scala:600)

    at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$$anonfun$apply$9.applyOrElse(Analyzer.scala:542)

    at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$resolveOperators$1.apply(LogicalPlan.scala:61)

    at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$resolveOperators$1.apply(LogicalPlan.scala:61)

    at org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:69)

    at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolveOperators(LogicalPlan.scala:60)

    at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$.apply(Analyzer.scala:542)

    at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$.apply(Analyzer.scala:479)

    at org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1$$anonfun$apply$1.apply(RuleExecutor.scala:85)

    at org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1$$anonfun$apply$1.apply(RuleExecutor.scala:82)

    at scala.collection.LinearSeqOptimized$class.foldLeft(LinearSeqOptimized.scala:124)

    at scala.collection.immutable.List.foldLeft(List.scala:84)

    at org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1.apply(RuleExecutor.scala:82)

    at org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1.apply(RuleExecutor.scala:74)

    at scala.collection.immutable.List.foreach(List.scala:381)

    at org.apache.spark.sql.catalyst.rules.RuleExecutor.execute(RuleExecutor.scala:74)

    at org.apache.spark.sql.execution.QueryExecution.analyzed$lzycompute(QueryExecution.scala:65)

    at org.apache.spark.sql.execution.QueryExecution.analyzed(QueryExecution.scala:63)

    at org.apache.spark.sql.execution.QueryExecution.assertAnalyzed(QueryExecution.scala:51)

    at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:64)

    at org.apache.spark.sql.Dataset.org$apache$spark$sql$Dataset$$withPlan(Dataset.scala:2603)

    at org.apache.spark.sql.Dataset.select(Dataset.scala:969)

    at org.apache.spark.sql.Dataset.selectExpr(Dataset.scala:1004)

    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)

    at sun.reflect.NativeMethodAccessorImpl.invoke(Unknown Source)

    at sun.reflect.DelegatingMethodAccessorImpl.invoke(Unknown Source)

    at java.lang.reflect.Method.invoke(Unknown Source)

    at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:237)

    at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)

    at py4j.Gateway.invoke(Gateway.java:280)

    at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)

    at py4j.commands.CallCommand.execute(CallCommand.java:79)

    at py4j.GatewayConnection.run(GatewayConnection.java:214)

    at java.lang.Thread.run(Unknown Source)



During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "...\MyModule.py", line 67, in <module>
    df_output = df.selectExpr('path.to.atag.more.again.text')
  File "...\spark-2.0.1-bin-hadoop2.7\python\pyspark\sql\dataframe.py", line 875, in selectExpr
    jdf = self._jdf.selectExpr(self._jseq(expr))
  File "...\spark-2.0.1-bin-hadoop2.7\python\lib\py4j-0.10.3-src.zip\py4j\java_gateway.py", line 1133, in __call__
  File "...\spark-2.0.1-bin-hadoop2.7\python\pyspark\sql\utils.py", line 69, in deco
    raise AnalysisException(s.split(': ', 1)[1], stackTrace)
pyspark.sql.utils.AnalysisException: 'No such struct field text in again; line 1 pos 0'

1 个答案:

答案 0 :(得分:1)

你也会爆炸 colorAdapter=new ColorAdapter(this, listColor, new Result(){ @Override public void onClick(View v,int position) { if(selectedBank==-1) { Toast.makeText(this, "Select bank first", Toast.LENGTH_LONG).show(); } else{ //// Do your code here /* here you get your color postion selected remove item from bank list and change with selected color image as per your list listBank.add(selectedBank,"changed color image"); bankAdapter.notifyItemChanged(); */ } } }); colorRecyleView.setAdapter(); ,例如:

atag

上面的代码片段将为您提供6行的DF - 每个代码一个

修改

如果你的xml文件各有不同的模式,那么使用spark Dataframes并不是最好的解决方案(Dataframes设计用于处理具有相同模式的文件)。如果您正在寻找特定标签insde文件,您可以尝试使用纯RDD API,使用DOM分析文件:

atags = df.select(explode(df.path.to.atag))
atags.select(explode(atags.col.more.again.text))