我在Apache Mesos上使用Apache Zeppelin,共有4个节点,共210 GB。
My Spark工作,它在一个小的事务数据集和一个大型事件数据集之间进行关联。我想根据时间和ID(事件时间和交易时间,ID和ID)将每笔交易与最近的事件相匹配。
我收到以下错误:
FetchFailed(null, shuffleId=1, mapId=-1, reduceId=20,
message=org.apache.spark.shuffle.MetadataFetchFailedException: Missing an output location for shuffle 1
at org.apache.spark.MapOutputTracker$$anonfun$org$apache$spark$MapOutputTracker$$convertMapStatuses$2.apply(MapOutputTracker.scala:542)
at org.apache.spark.MapOutputTracker$$anonfun$org$apache$spark$MapOutputTracker$$convertMapStatuses$2.apply(MapOutputTracker.scala:538)
at scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:772)
at scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:108)
at scala.collection.TraversableLike$WithFilter.foreach(TraversableLike.scala:771)
at org.apache.spark.MapOutputTracker$.org$apache$spark$MapOutputTracker$$convertMapStatuses(MapOutputTracker.scala:538)
at org.apache.spark.MapOutputTracker.getMapSizesByExecutorId(MapOutputTracker.scala:155)
at org.apache.spark.shuffle.BlockStoreShuffleReader.read(BlockStoreShuffleReader.scala:47)
at org.apache.spark.rdd.ShuffledRDD.compute(ShuffledRDD.scala:98)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
at org.apache.spark.CacheManager.getOrCompute(CacheManager.scala:69)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:268)
at org.apache.spark.rdd.CoGroupedRDD$$anonfun$compute$2.apply(CoGroupedRDD.scala:140)
at org.apache.spark.rdd.CoGroupedRDD$$anonfun$compute$2.apply(CoGroupedRDD.scala:136)
at scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:772)
at scala.collection.immutable.List.foreach(List.scala:318)
at scala.collection.TraversableLike$WithFilter.foreach(TraversableLike.scala:771)
at org.apache.spark.rdd.CoGroupedRDD.compute(CoGroupedRDD.scala:136)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
at org.apache.spark.scheduler.Task.run(Task.scala:89)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:213)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
这是我的算法
val groupRDD = event
.map { evt => ((evt.id, evt.date_time.toString.dropRight(8)), cdr) }
.groupByKey(new HashPartitioner(128))
.persist(StorageLevel.MEMORY_AND_DISK_SER)
val joinedRDD = groupRDD.rightOuterJoin {
transactions.keyBy { transac => (transac.id, transac.dateTime.toString.dropRight(8)) }}
val result = joinedRDD.mapValues { case(a,b) =>
val goodTransac = a.getOrElse(List(GeoLoc("",0L,"","","","","")))
.reduce((v1,v2) => minDelay(b.dateTime,v1,v2))
SomeClass(b.id, b....., goodTransac.date_time,.....)
}
groupByKey
不应该分组太多元素(每个密钥最多50个)。
我注意到当内存太短时发生了错误,因此我决定在RAM和磁盘上继续序列化,然后将序列化程序更改为Kryo。我还将spark.memory.storageFraction
缩减为0.2
,以便为处理留出更多空间。
当我检查Web UI时,我发现GC在处理过程中花费的时间越来越多。当作业最终失败时,GC在运行时间的22分钟内需要20分钟,但不会在所有工作人员上运行。
我已经审核了Why do Spark jobs fail with org.apache.spark.shuffle.MetadataFetchFailedException: Missing an output location for shuffle 0 in speculation mode?,但我的群集仍然有足够的内存 - 大约有90 GB可用于Mesos。
答案 0 :(得分:0)
我要做的是检查event
RDD和groupByKey
之后的分区数。使用RDD.getNumPartitions。
使用StorageLevel.MEMORY_AND_DISK_SER
将需要更多IO,这可能会降低执行程序的速度,并且SER
可能会导致更长的GC(毕竟,数据集在内存中,并且必须序列化几乎两倍记忆要求)。
我强烈建议不此时使用MEMORY_AND_DISK_SER
。
我还要查看result
RDD的依赖关系图,看看每个阶段使用了多少shuffle和分区。
result.toDebugString
很少有地方可以出错。
P.S。从Web UI的作业,阶段,存储和执行器页面附加屏幕截图对于缩小根本原因非常有帮助。