我正在做以下事情:
df.show()
使用spark sql使用两个数据帧中存在的唯一id列连接两个数据帧:
Dataset<Row> mergedData = ss.sql("SELECT * from hbasetable, neo4jtable WHERE hbasetable.nodeId = neo4jtable.id");
mergedData.show()
这工作得非常好。但是,现在,我更改了用于获取neo4j数据的cypher查询。之前我的密码曾经是这样的:
Match (n:Type1 {caption:'type1caption"'})-[:contains]->(m:Type2) return m.attr1, m.attr2, m.attr3, m.attr4, m.attr5, m.attr6, m.attr7, m.attr8, m.id as id, m.attr9, m.attr10, m.attr11
现在就像这样:
Match (m:Type1) return m.attr1, m.attr2, m.attr3, m.attr4, m.attr5, m.attr6, m.attr7, m.attr8, m.id as id, m.attr9, m.attr10, m.attr11
但现在加入失败了。它给了我以下例外:
Long is not a valid external type for schema of string
似乎新的neo4j数据帧和hbase数据帧的内容都被正确获取,因为neo4jdf.show()
和hbasedf.show()
都在控制台上显示数据。我想知道,如果早期的加入工作正常并且数据被正确获取,那么加入失败的原因。
主要担心的是我无法解释控制台上打印的堆栈跟踪。它看起来像这样:
18/06/04 17:47:48 ERROR Executor: Exception in task 0.0 in stage 4.0 (TID 4)
java.lang.RuntimeException: Error while encoding: java.lang.RuntimeException: java.lang.Long is not a valid external type for schema of string
if (assertnotnull(input[0, org.apache.spark.sql.Row, true], top level row object).isNullAt) null else validateexternaltype(getexternalrowfield(assertnotnull(input[0, org.apache.spark.sql.Row, true], top level row object), 0, m.attr1), NullType) AS m.attr1#0
+- if (assertnotnull(input[0, org.apache.spark.sql.Row, true], top level row object).isNullAt) null else validateexternaltype(getexternalrowfield(assertnotnull(input[0, org.apache.spark.sql.Row, true], top level row object), 0, m.attr1), NullType)
:- assertnotnull(input[0, org.apache.spark.sql.Row, true], top level row object).isNullAt
: :- assertnotnull(input[0, org.apache.spark.sql.Row, true], top level row object)
: : +- input[0, org.apache.spark.sql.Row, true]
: +- 0
:- null
+- validateexternaltype(getexternalrowfield(assertnotnull(input[0, org.apache.spark.sql.Row, true], top level row object), 0, m.attr1), NullType)
+- getexternalrowfield(assertnotnull(input[0, org.apache.spark.sql.Row, true], top level row object), 0, m.attr1)
+- assertnotnull(input[0, org.apache.spark.sql.Row, true], top level row object)
+- input[0, org.apache.spark.sql.Row, true]
:
:
lot of stack trace omitted
:
:
if (assertnotnull(input[0, org.apache.spark.sql.Row, true], top level row object).isNullAt) null else staticinvoke(class org.apache.spark.unsafe.types.UTF8String, StringType, fromString, validateexternaltype(getexternalrowfield(assertnotnull(input[0, org.apache.spark.sql.Row, true], top level row object), 11, m.attr11), StringType), true) AS m.attr11#11
+- if (assertnotnull(input[0, org.apache.spark.sql.Row, true], top level row object).isNullAt) null else staticinvoke(class org.apache.spark.unsafe.types.UTF8String, StringType, fromString, validateexternaltype(getexternalrowfield(assertnotnull(input[0, org.apache.spark.sql.Row, true], top level row object), 11, m.attr11), StringType), true)
:- assertnotnull(input[0, org.apache.spark.sql.Row, true], top level row object).isNullAt
: :- assertnotnull(input[0, org.apache.spark.sql.Row, true], top level row object)
: : +- input[0, org.apache.spark.sql.Row, true]
: +- 11
:- null
+- staticinvoke(class org.apache.spark.unsafe.types.UTF8String, StringType, fromString, validateexternaltype(getexternalrowfield(assertnotnull(input[0, org.apache.spark.sql.Row, true], top level row object), 11, m.attr11), StringType), true)
+- validateexternaltype(getexternalrowfield(assertnotnull(input[0, org.apache.spark.sql.Row, true], top level row object), 11, m.attr11), StringType)
+- getexternalrowfield(assertnotnull(input[0, org.apache.spark.sql.Row, true], top level row object), 11, m.attr11)
+- assertnotnull(input[0, org.apache.spark.sql.Row, true], top level row object)
+- input[0, org.apache.spark.sql.Row, true]
at org.apache.spark.sql.catalyst.encoders.ExpressionEncoder.toRow(ExpressionEncoder.scala:279)
at org.apache.spark.sql.SparkSession$$anonfun$5.apply(SparkSession.scala:537)
at org.apache.spark.sql.SparkSession$$anonfun$5.apply(SparkSession.scala:537)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:370)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:370)
at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown Source)
at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:370)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:369)
at org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:147)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:79)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:47)
at org.apache.spark.scheduler.Task.run(Task.scala:85)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Caused by: java.lang.RuntimeException: java.lang.Long is not a valid external type for schema of string
at org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply(Unknown Source)
at org.apache.spark.sql.catalyst.encoders.ExpressionEncoder.toRow(ExpressionEncoder.scala:276)
... 16 more
可以在gist here
中找到完整的堆栈跟踪我觉得这必须与neo4j数据的不一致有关。如果我能知道发生异常的neo4j节点的id,我可以检查它。但我绝对无法解释这种堆栈跟踪。 在执行加入时是否可以知道数据帧的哪个记录准备失败?
更新
我删除了与join和hbase相关的所有内容并添加了neo4jdf.show(24000,false);
,并且它给出了与上面相同的错误。有23748条记录。当我打印少量记录(比如neo4jdf.show(1000)
)时,它会打印出来而没有错误。但是,当我允许打印24000条记录时,它会失败。这意味着某些节点出了问题。但我怎么能指出它?
答案 0 :(得分:0)
问题的根源是数据源中的异构数据。 Spark SQL使用关系模型,不允许异构列。
如果你不能保证所有记录都是正确的结构,我建议把所有记录都取为字符串:
return toString(m.attr1), toString(m.attr2), ..., toString(m.attr11)
并使用标准Spark运算符强制转换为所需类型:
df.select($"attr1".cast(...), ...)