我有大约24 GB的记录,我正在阅读和处理来自cassandra的火花。我使用了flatmaptopair和filter转换,然后使用datastax cassandra连接器存储RDD。但是当执行保存到cassandra的操作时,我的执行程序失败并且它开始抛出以下异常 -
16/03/17 03:00:32 WARN TaskSetManager: Lost task 11.1 in stage 3.0 (TID 133, 10.0.0.65): FetchFailed(null, shuffleId=0, mapId=-1, reduceId=11, message=
org.apache.spark.shuffle.MetadataFetchFailedException: Missing an output location for shuffle 0
at org.apache.spark.MapOutputTracker$$anonfun$org$apache$spark$MapOutputTracker$$convertMapStatuses$2.apply(MapOutputTracker.scala:460)
at org.apache.spark.MapOutputTracker$$anonfun$org$apache$spark$MapOutputTracker$$convertMapStatuses$2.apply(MapOutputTracker.scala:456)
at scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:772)
at scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:108)
at scala.collection.TraversableLike$WithFilter.foreach(TraversableLike.scala:771)
at org.apache.spark.MapOutputTracker$.org$apache$spark$MapOutputTracker$$convertMapStatuses(MapOutputTracker.scala:456)
at org.apache.spark.MapOutputTracker.getMapSizesByExecutorId(MapOutputTracker.scala:183)
at org.apache.spark.shuffle.hash.HashShuffleReader.read(HashShuffleReader.scala:47)
at org.apache.spark.rdd.ShuffledRDD.compute(ShuffledRDD.scala:90)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
at org.apache.spark.scheduler.Task.run(Task.scala:88)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
在检查exectur日志时,它显示以下错误 -
错误MapOutputTracker:缺少shuffle 0的输出位置
一旦它抛出异常火花再次重启所有阶段。我的群集有2个节点,16 GB内存和4个内核。我正在运行spark作为独立群集,每个工作节点分配12 gb。此外,当我在本地机器上运行10 GB数据的工作时,它的工作非常好。我已尝试将持久性级别更改为DISK_ONLY和MEMORY_AND_DISK但无效。
编辑1 -
我已经能够弄清楚当我通过键操作进行减少时,洗牌是失败的,因为一个键的聚合记录总计达250 mb。这是我的reducebykey片段 -
JavaPairRDD<String,Products[]> categoryMapCollection=categoryMapFiltered.reduceByKey(
new Function2<Products[], Products[], Products[]>() {
@Override
public Products[] call(Products[] p1,Products[] p2)
{
Products[] both = (Products[])ArrayUtils.addAll(p1, p2);
return both;
}
});