Spark数据集显示模式,但抛出show()方法的UnsupportedOperation异常

时间:2018-02-21 16:13:20

标签: apache-spark apache-spark-dataset

使用自定义java类

的bean编码器创建了Spark数据集
java.lang.UnsupportedOperationException: Cannot grow BufferHolder by size 0 because the size after growing exceeds size limitation 2147483647
at org.apache.spark.sql.catalyst.expressions.codegen.BufferHolder.grow(BufferHolder.java:65)
at org.apache.spark.sql.catalyst.expressions.codegen.UnsafeRowWriter.write(UnsafeRowWriter.java:214)
at org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply2_2$(Unknown Source)
at org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply(Unknown Source)
at org.apache.spark.sql.execution.LocalTableScanExec$$anonfun$unsafeRows$1.apply(LocalTableScanExec.scala:41)
at org.apache.spark.sql.execution.LocalTableScanExec$$anonfun$unsafeRows$1.apply(LocalTableScanExec.scala:41)
at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
at scala.collection.AbstractTraversable.map(Traversable.scala:104)
at org.apache.spark.sql.execution.LocalTableScanExec.unsafeRows$lzycompute(LocalTableScanExec.scala:41)
at org.apache.spark.sql.execution.LocalTableScanExec.unsafeRows(LocalTableScanExec.scala:36)
at org.apache.spark.sql.execution.LocalTableScanExec.executeTake(LocalTableScanExec.scala:72)
at org.apache.spark.sql.execution.CollectLimitExec.executeCollect(limit.scala:38)
at org.apache.spark.sql.Dataset.org$apache$spark$sql$Dataset$$collectFromPlan(Dataset.scala:2861)
at org.apache.spark.sql.Dataset$$anonfun$head$1.apply(Dataset.scala:2150)
at org.apache.spark.sql.Dataset$$anonfun$head$1.apply(Dataset.scala:2150)
at org.apache.spark.sql.Dataset$$anonfun$55.apply(Dataset.scala:2842)
at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:65)
at org.apache.spark.sql.Dataset.withAction(Dataset.scala:2841)
at org.apache.spark.sql.Dataset.head(Dataset.scala:2150)
at org.apache.spark.sql.Dataset.take(Dataset.scala:2363)
at org.apache.spark.sql.Dataset.showString(Dataset.scala:241)
at org.apache.spark.sql.Dataset.show(Dataset.scala:637)
at org.apache.spark.sql.Dataset.show(Dataset.scala:596)
at org.apache.spark.sql.Dataset.show(Dataset.scala:605)

customJavaTypeDataset.printschema()工作得很好。它正确显示了架构。

但是,customJavaTypeDataset.show()会抛出以下异常

import java.util.HashMap;


public class arrayTest {
    public static void main(String[] args) {
        int[] numbers = {10,11,9,5,5,3,2,2,1};
        System.out.println(countSingles(numbers));

    }

public static HashMap<Integer, Integer> countSingles(int[]numbers) {

HashMap<Integer, Integer> hash = new HashMap<Integer, Integer>();

for(int i=0; i<numbers.length-1;i++) {
    if(hash.get(numbers[i])!= null){
       int num = hash.get(numbers[i]);
       num++;
       hash.put(numbers[i], num);
    }else{
        hash.put(numbers[i], 1);
    }
}

return hash;
}
}

CustomJavaType的所有嵌套类都实现了serializable 列表中的对象数为5 printSchema是预期的。

1 个答案:

答案 0 :(得分:0)

这不是问题的真正解决方案(请参阅上面的评论),但它可能有助于让人更接近...

我相信我已经跟踪了触发此错误的代码点。它位于spark-catalyst_2.11-2.2.0:/.../ org / apache / spark / sql / catalyst / expressions / UnsafeRow.java:getUTF8String line 418.在那一行a&#34; long&#34;被转换为&#34; int&#34;,但是对于int来说,该值太大,并且包装的值导致负数,然后在尝试生长字节缓冲区时使用该数字(沿着该行的某个位置,抛出并吞下/忽略java.lang.NegativeArraySizeException。

最终我们到达spark-catalyst_2.11-2.2.0:/.../ org / apache / spark / sql / catalyst / expressions / codegen / BufferHolder.java:grow line 64其中if()语句出错一个太大的值的负值,因此抛出UnsupportedOperationException。

我不知道如何处理这些信息。也许有人知道。这是一种应该被报告为错误的东西吗?

以下是我的调试器中的一些视觉效果,用于显示详细信息: getUTF8String__spark-catalyst_2.11-2.2.0__org.apache.spark.sql.catalyst.expressions.UnsafeRow.png

grow__spark-catalyst_2.11-2.2.0__org.apache.spark.sql.catalyst.expressions.codegen.BufferHolder.png