我尝试计算阴性样本的数量,如下所示:
val numNegatives = dataSet.filter(col("label") < 0.5).count
但是我的尺寸超过了Integer.MAX_VALUE错误:
java.lang.IllegalArgumentException: Size exceeds Integer.MAX_VALUE
at sun.nio.ch.FileChannelImpl.map(FileChannelImpl.java:869)
at org.apache.spark.storage.DiskStore$$anonfun$getBytes$2.apply(DiskStore.scala:127)
at org.apache.spark.storage.DiskStore$$anonfun$getBytes$2.apply(DiskStore.scala:115)
at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1239)
at org.apache.spark.storage.DiskStore.getBytes(DiskStore.scala:129)
at org.apache.spark.storage.DiskStore.getBytes(DiskStore.scala:136)
at org.apache.spark.storage.BlockManager.doGetLocal(BlockManager.scala:512)
at org.apache.spark.storage.BlockManager.getLocal(BlockManager.scala:427)
at org.apache.spark.storage.BlockManager.get(BlockManager.scala:636)
at org.apache.spark.CacheManager.getOrCompute(CacheManager.scala:44)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:268)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
at org.apache.spark.scheduler.Task.run(Task.scala:89)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:227)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
某些解决方案建议添加分区号,因此我将以上代码更新如下:
val data = dataSet.repartition(5000).cache()
val numNegatives = data.filter(col("label") < 0.5).count
但它报告同样的错误!这让我困惑了几天。谁能帮我? 感谢。
答案 0 :(得分:0)
在过滤前尝试重新分区:
val numNegatives = dataSet.repartition(1000).filter(col(&#34; label&#34;)&lt; 0.5).count
Filter使用原始DataSet分区执行并重新分区结果。您需要使用较小的分区进行过滤。
答案 1 :(得分:0)
问题是它实现后的ShuffleRDD块大小大于2GB。 Spark有这个limitation。您需要更改默认设置为200的spark.sql.shuffle.partitions
参数。
此外,您可能需要增加数据集所具有的分区数。重新分区并保存,然后读取新数据集并执行操作。
spark.sql("SET spark.sql.shuffle.partitions = 10000")
dataset.repartition(10000).write.parquet("/path/to/hdfs")
val newDataset = spark.read.parquet("/path/to/hdfs")
newDatase.filter(...).count
或者,如果您想使用Hive Table
spark.sql("SET spark.sql.shuffle.partitions = 10000")
dataset.repartition(10000).asveAsTable("newTableName")
val newDataset = spark.table("newTableName")
newDatase.filter(...).count