我正在尝试将JavaRDD转换为数据集
Dataset<Row> dataset = some dataset ....
JavaRDD<Row> rdd = dataset.toJavaRdd().mapToPair(...).mapParitition(...);
rdd.take(30).forEach(x -> System.out.println(x)); // printing rows sucessfully
Dataset result = spark.createDataset(rdd.rdd(), Encoders.bean(Row.class));
map函数内部没有什么特别的地方,rdd的打印工作正常,但是当我尝试将其转换回Dataset时,出现以下错误:
Exception in thread "main" java.lang.NullPointerException at org.spark_project.guava.reflect.TypeToken.method(TypeToken.java:465)
at org.apache.spark.sql.catalyst.JavaTypeInferenceanonfun2.apply(JavaTypeInference.scala:126)
at org.apache.spark.sql.catalyst.JavaTypeInferenceanonfun2.apply(JavaTypeInference.scala:125)
at scala.collection.TraversableLikeanonfunmap1.apply(TraversableLike.scala:234)
at scala.collection.TraversableLikeanonfunmap1.apply(TraversableLike.scala:234)
at scala.collection.IndexedSeqOptimizedclass.foreach(IndexedSeqOptimized.scala:33)
at scala.collection.mutable.ArrayOpsofRef.foreach(ArrayOps.scala:186)
at scala.collection.TraversableLikeclass.map(TraversableLike.scala:234)
at scala.collection.mutable.ArrayOpsofRef.map(ArrayOps.scala:186)
at org.apache.spark.sql.catalyst.JavaTypeInference.orgapachesparksqlcatalystJavaTypeInferenceinferDataType(JavaTypeInference.scala:125)
at org.apache.spark.sql.catalyst.JavaTypeInference.inferDataType(JavaTypeInference.scala:55)
at org.apache.spark.sql.catalyst.encoders.ExpressionEncoder.javaBean(ExpressionEncoder.scala:89)
at org.apache.spark.sql.Encoders.bean(Encoders.scala:142)
at org.apache.spark.sql.Encoders.bean(Encoders.scala)
at io.arlas.main.CSVReaderApp.split_large_gaps_sequences(CSVReaderApp.java:158)
at io.arlas.main.CSVReaderApp.main(CSVReaderApp.java:100)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.apache.spark.deploy.SparkSubmit.orgapachesparkdeploySparkSubmitrunMain(SparkSubmit.scala:738)
at org.apache.spark.deploy.SparkSubmit.doRunMain1(SparkSubmit.scala:187)
at org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:212)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala:126)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
我不想创建自定义bean类并在每次必须执行一些map函数时都使用它,这似乎不是实际的解决方案!
任何想法如何解决此问题?