序列化任务超过允许的最大值,Spark - Cluster

时间:2017-01-06 23:21:11

标签: apache-spark rdd

所以我正在使用java来部署spark,这是我的原始代码:

List<Float> data = some_data;
JavaRDD<Float> dataAsRDD = javaSparkContext.parallelize(data);
JavaRDD<Float> dataWithoutNaN = dataAsRDD.filter(number -> !number.isNaN());
JavaDoubleRDD dataAsDouble = dataWithoutNaN.mapToDouble(number -> (double) number);
logger.info("\t\t\tMean: " + dataAsDouble.mean());

因此,这可以在带有警告的apache-spark独立模式下运行,但在集群模式下会出现错误:(第86行是 dataAsDouble.mean()

17/01/06 17:54:24 INFO DAGScheduler: Job 0 failed: mean at Cluster.java:86, took 12.678086 s
Exception in thread "main" org.apache.spark.SparkException: Job aborted due to stage failure: Serialized task 0:0 was 45337325 bytes, which exceeds max allowed: spark.akka.frameSize (10485760 bytes) - reserved (204800 bytes). Consider increasing spark.akka.frameSize or using broadcast variables for large values.

我按照异常Exceeding spark.akka.frameSize when saving Word2VecModelSpark broadcast error: exceeds spark.akka.frameSize Consider using broadcast中建议的说明操作,我使用了广播变量:

List<Float> dataAsList = some_data;
Broadcast<List<Float>> broadcast = javaSparkContext.broadcast(dataAsList);
JavaRDD<Float> dataAsRDD = javaSparkContext.parallelize(broadcast.value());
JavaRDD<Float> dataWithoutNaN = dataAsRDD.filter(number -> !number.isNaN());
JavaDoubleRDD dataAsDouble = dataWithoutNaN.mapToDouble(number -> (double) number);
logger.info("\t\t\tMean: " + dataAsDouble.mean());

但是我一直在犯同样的错误,¿我做错了什么?

提前致谢!

1 个答案:

答案 0 :(得分:0)

你正在以错误的方式使用广播。那些不应该立即转换为RDD(就像在javaSparkContext.parallelize(broadcast.value())中那样),但它们应该传递给RDD上的操作。由于你只是为了计算它的平均值来并行化一些数据结构(我假设你只是在试验Spark),你无法使用广播实现这一点。正如消息警告所示,您有另一种解决方案:增加spark.akka.frameSize。这可以通过将它传递给你的spark-shell / spark-submit:

来实现
--conf spark.akka.frameSize=64

这会将帧大小增加到64 MB。