spark error:java.lang.IllegalArgumentException:Size超过Integer.MAX_VALUE

时间:2018-04-13 06:09:13

标签: scala apache-spark machine-learning

我尝试计算阴性样本的数量,如下所示:

val numNegatives = dataSet.filter(col("label") < 0.5).count

但是我的尺寸超过了Integer.MAX_VALUE错误:

java.lang.IllegalArgumentException: Size exceeds Integer.MAX_VALUE
    at sun.nio.ch.FileChannelImpl.map(FileChannelImpl.java:869)
    at org.apache.spark.storage.DiskStore$$anonfun$getBytes$2.apply(DiskStore.scala:127)
    at org.apache.spark.storage.DiskStore$$anonfun$getBytes$2.apply(DiskStore.scala:115)
    at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1239)
    at org.apache.spark.storage.DiskStore.getBytes(DiskStore.scala:129)
    at org.apache.spark.storage.DiskStore.getBytes(DiskStore.scala:136)
    at org.apache.spark.storage.BlockManager.doGetLocal(BlockManager.scala:512)
    at org.apache.spark.storage.BlockManager.getLocal(BlockManager.scala:427)
    at org.apache.spark.storage.BlockManager.get(BlockManager.scala:636)
    at org.apache.spark.CacheManager.getOrCompute(CacheManager.scala:44)
    at org.apache.spark.rdd.RDD.iterator(RDD.scala:268)
    at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
    at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
    at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
    at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
    at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
    at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
    at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
    at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
    at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
    at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73)
    at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
    at org.apache.spark.scheduler.Task.run(Task.scala:89)
    at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:227)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
    at java.lang.Thread.run(Thread.java:745)

某些解决方案建议添加分区号,因此我将以上代码更新如下:

val data = dataSet.repartition(5000).cache()
val numNegatives = data.filter(col("label") < 0.5).count

但它报告同样的错误!这让我困惑了几天。谁能帮我? 感谢。

2 个答案:

答案 0 :(得分:0)

在过滤前尝试重新分区:

val numNegatives = dataSet.repartition(1000).filter(col(&#34; label&#34;)&lt; 0.5).count

Filter使用原始DataSet分区执行并重新分区结果。您需要使用较小的分区进行过滤。

答案 1 :(得分:0)

问题是它实现后的ShuffleRDD块大小大于2GB。 Spark有这个limitation。您需要更改默认设置为200的spark.sql.shuffle.partitions参数。

此外,您可能需要增加数据集所具有的分区数。重新分区并保存,然后读取新数据集并执行操作。

spark.sql("SET spark.sql.shuffle.partitions = 10000")
dataset.repartition(10000).write.parquet("/path/to/hdfs")
val newDataset = spark.read.parquet("/path/to/hdfs")  
newDatase.filter(...).count

或者,如果您想使用Hive Table

spark.sql("SET spark.sql.shuffle.partitions = 10000")
dataset.repartition(10000).asveAsTable("newTableName")
val newDataset = spark.table("newTableName")  
newDatase.filter(...).count