使用Java对象而不是Scala案例类调用RDD#toDS时的StackOverflowError

时间:2016-10-21 15:42:41

标签: scala apache-spark apache-spark-dataset

我正在尝试使用第三方库中定义的现有域对象,即HAPI-FHIR的Patient对象来创建类似强烈类型的Spark DataSet[Patient]

scala> val patients = sc.loadFromMongoDB(ReadConfig(Map("uri" -> "mongodb://mongodb/fhir.patients")))
patients: com.mongodb.spark.rdd.MongoRDD[org.bson.Document] = MongoRDD[0] at RDD at MongoRDD.scala:47

scala> val patientsDataSet = patients.toDS[Patient](classOf[Patient])

但是,当我在上面进行RDD#toDS调用时,我得到了很长的StackOverflowError

完整堆栈跟踪在此处:https://gist.github.com/vratnagiri-veriskhealth/6dcec9dbc6f74308019ab16c8d278a9b

鉴于上面提到的域对象的复杂性,我意识到这可能是一个愚蠢的差事,但是,鉴于我是一个scala新手,我确实想确保我不会错过任何简单的调整在放弃这种追求之前,我可能会开始工作。

这是stacktrace的一部分:

java.lang.StackOverflowError
  at org.spark-project.guava.collect.ImmutableCollection.<init>(ImmutableCollection.java:48)
  at org.spark-project.guava.collect.ImmutableSet.<init>(ImmutableSet.java:396)
  at org.spark-project.guava.collect.ImmutableMapEntrySet.<init>(ImmutableMapEntrySet.java:35)
  at org.spark-project.guava.collect.RegularImmutableMap$EntrySet.<init>(RegularImmutableMap.java:174)
  at org.spark-project.guava.collect.RegularImmutableMap$EntrySet.<init>(RegularImmutableMap.java:174)
  at org.spark-project.guava.collect.RegularImmutableMap.createEntrySet(RegularImmutableMap.java:170)
  at org.spark-project.guava.collect.ImmutableMap.entrySet(ImmutableMap.java:385)
  at org.spark-project.guava.collect.ImmutableMap.entrySet(ImmutableMap.java:61)
  at org.spark-project.guava.reflect.TypeResolver.where(TypeResolver.java:97)
  at org.spark-project.guava.reflect.TypeResolver.accordingTo(TypeResolver.java:65)
  at org.spark-project.guava.reflect.TypeToken.resolveType(TypeToken.java:266)
  at org.spark-project.guava.reflect.TypeToken$1.getGenericReturnType(TypeToken.java:469)
  at org.spark-project.guava.reflect.Invokable.getReturnType(Invokable.java:109)
  at org.apache.spark.sql.catalyst.JavaTypeInference$$anonfun$2.apply(JavaTypeInference.scala:110)
  at org.apache.spark.sql.catalyst.JavaTypeInference$$anonfun$2.apply(JavaTypeInference.scala:109)
  at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
  at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
  at scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
  at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:108)
  at scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
  at scala.collection.mutable.ArrayOps$ofRef.map(ArrayOps.scala:108)
  at org.apache.spark.sql.catalyst.JavaTypeInference$.org$apache$spark$sql$catalyst$JavaTypeInference$$inferDataType(JavaTypeInference.scala:109)
  at org.apache.spark.sql.catalyst.JavaTypeInference$.org$apache$spark$sql$catalyst$JavaTypeInference$$inferDataType(JavaTypeInference.scala:95)
  at org.apache.spark.sql.catalyst.JavaTypeInference$$anonfun$2.apply(JavaTypeInference.scala:111)
  at org.apache.spark.sql.catalyst.JavaTypeInference$$anonfun$2.apply(JavaTypeInference.scala:109)
  at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
  at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
  at scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
  at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:108)

谢谢!

1 个答案:

答案 0 :(得分:0)

您是否尝试在将RDD转换为数据集之前和之后打印架构?比较模式并确保模式与元素数量及其各自的数据类型一致。转换前和转换后打印的架构必须相同。