两个数据帧的加入工作正在进行,但现在当我更改其中一个数据帧的内容时它失败了

时间:2018-06-04 12:59:46

标签: apache-spark dataframe neo4j

我正在做以下事情:

  • 使用neo4j-spark-connector
  • 从neo4j获取数据帧
  • 使用apache hbase spark connector
  • 从hbase获取数据帧
  • df.show()
  • 将两个数据框打印到控制台
  • 使用spark sql使用两个数据帧中存在的唯一id列连接两个数据帧:

    Dataset<Row> mergedData = ss.sql("SELECT * from hbasetable, neo4jtable WHERE hbasetable.nodeId = neo4jtable.id");
    
  • 通过mergedData.show()
  • 打印将数据框连接到控制台

这工作得非常好。但是,现在,我更改了用于获取neo4j数据的cypher查询。之前我的密码曾经是这样的:

Match (n:Type1 {caption:'type1caption"'})-[:contains]->(m:Type2) return m.attr1, m.attr2, m.attr3, m.attr4, m.attr5, m.attr6, m.attr7, m.attr8, m.id as id, m.attr9, m.attr10, m.attr11

现在就像这样:

Match (m:Type1) return m.attr1, m.attr2, m.attr3, m.attr4, m.attr5, m.attr6, m.attr7, m.attr8, m.id as id, m.attr9, m.attr10, m.attr11

但现在加入失败了。它给了我以下例外:

Long is not a valid external type for schema of string

似乎新的neo4j数据帧和hbase数据帧的内容都被正确获取,因为neo4jdf.show()hbasedf.show()都在控制台上显示数据。我想知道,如果早期的加入工作正常并且数据被正确获取,那么加入失败的原因。

主要担心的是我无法解释控制台上打印的堆栈跟踪。它看起来像这样:

18/06/04 17:47:48 ERROR Executor: Exception in task 0.0 in stage 4.0 (TID 4)
java.lang.RuntimeException: Error while encoding: java.lang.RuntimeException: java.lang.Long is not a valid external type for schema of string
if (assertnotnull(input[0, org.apache.spark.sql.Row, true], top level row object).isNullAt) null else validateexternaltype(getexternalrowfield(assertnotnull(input[0, org.apache.spark.sql.Row, true], top level row object), 0, m.attr1), NullType) AS m.attr1#0
+- if (assertnotnull(input[0, org.apache.spark.sql.Row, true], top level row object).isNullAt) null else validateexternaltype(getexternalrowfield(assertnotnull(input[0, org.apache.spark.sql.Row, true], top level row object), 0, m.attr1), NullType)
   :- assertnotnull(input[0, org.apache.spark.sql.Row, true], top level row object).isNullAt
   :  :- assertnotnull(input[0, org.apache.spark.sql.Row, true], top level row object)
   :  :  +- input[0, org.apache.spark.sql.Row, true]
   :  +- 0
   :- null
   +- validateexternaltype(getexternalrowfield(assertnotnull(input[0, org.apache.spark.sql.Row, true], top level row object), 0, m.attr1), NullType)
      +- getexternalrowfield(assertnotnull(input[0, org.apache.spark.sql.Row, true], top level row object), 0, m.attr1)
         +- assertnotnull(input[0, org.apache.spark.sql.Row, true], top level row object)
            +- input[0, org.apache.spark.sql.Row, true]

     :
     :
   lot of stack trace omitted
     :
     :
if (assertnotnull(input[0, org.apache.spark.sql.Row, true], top level row object).isNullAt) null else staticinvoke(class org.apache.spark.unsafe.types.UTF8String, StringType, fromString, validateexternaltype(getexternalrowfield(assertnotnull(input[0, org.apache.spark.sql.Row, true], top level row object), 11, m.attr11), StringType), true) AS m.attr11#11
+- if (assertnotnull(input[0, org.apache.spark.sql.Row, true], top level row object).isNullAt) null else staticinvoke(class org.apache.spark.unsafe.types.UTF8String, StringType, fromString, validateexternaltype(getexternalrowfield(assertnotnull(input[0, org.apache.spark.sql.Row, true], top level row object), 11, m.attr11), StringType), true)
   :- assertnotnull(input[0, org.apache.spark.sql.Row, true], top level row object).isNullAt
   :  :- assertnotnull(input[0, org.apache.spark.sql.Row, true], top level row object)
   :  :  +- input[0, org.apache.spark.sql.Row, true]
   :  +- 11
   :- null
   +- staticinvoke(class org.apache.spark.unsafe.types.UTF8String, StringType, fromString, validateexternaltype(getexternalrowfield(assertnotnull(input[0, org.apache.spark.sql.Row, true], top level row object), 11, m.attr11), StringType), true)
      +- validateexternaltype(getexternalrowfield(assertnotnull(input[0, org.apache.spark.sql.Row, true], top level row object), 11, m.attr11), StringType)
         +- getexternalrowfield(assertnotnull(input[0, org.apache.spark.sql.Row, true], top level row object), 11, m.attr11)
            +- assertnotnull(input[0, org.apache.spark.sql.Row, true], top level row object)
               +- input[0, org.apache.spark.sql.Row, true]

    at org.apache.spark.sql.catalyst.encoders.ExpressionEncoder.toRow(ExpressionEncoder.scala:279)
    at org.apache.spark.sql.SparkSession$$anonfun$5.apply(SparkSession.scala:537)
    at org.apache.spark.sql.SparkSession$$anonfun$5.apply(SparkSession.scala:537)
    at scala.collection.Iterator$$anon$11.next(Iterator.scala:370)
    at scala.collection.Iterator$$anon$11.next(Iterator.scala:370)
    at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown Source)
    at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
    at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:370)
    at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:369)
    at org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:147)
    at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:79)
    at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:47)
    at org.apache.spark.scheduler.Task.run(Task.scala:85)
    at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
    at java.lang.Thread.run(Thread.java:745)
Caused by: java.lang.RuntimeException: java.lang.Long is not a valid external type for schema of string
    at org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply(Unknown Source)
    at org.apache.spark.sql.catalyst.encoders.ExpressionEncoder.toRow(ExpressionEncoder.scala:276)
    ... 16 more

可以在gist here

中找到完整的堆栈跟踪

我觉得这必须与neo4j数据的不一致有关。如果我能知道发生异常的neo4j节点的id,我可以检查它。但我绝对无法解释这种堆栈跟踪。 在执行加入时是否可以知道数据帧的哪个记录准备失败?

更新

我删除了与join和hbase相关的所有内容并添加了neo4jdf.show(24000,false);,并且它给出了与上面相同的错误。有23748条记录。当我打印少量记录(比如neo4jdf.show(1000))时,它会打印出来而没有错误。但是,当我允许打印24000条记录时,它会失败。这意味着某些节点出了问题。但我怎么能指出它?

1 个答案:

答案 0 :(得分:0)

问题的根源是数据源中的异构数据。 Spark SQL使用关系模型,不允许异构列。

如果你不能保证所有记录都是正确的结构,我建议把所有记录都取为字符串:

return toString(m.attr1), toString(m.attr2), ..., toString(m.attr11)

并使用标准Spark运算符强制转换为所需类型:

df.select($"attr1".cast(...), ...)