Apache Spark 1.6.1 - 加入2个大文件时的OutofMemory错误

时间:2017-12-20 18:09:05

标签: apache-spark out-of-memory

我们正在使用Beam SDK来运行Apache Spark程序。 Beam在内部为连接和groupbykey调用Spark框架。在进行连接和按键分组时,我们会出现OutOfMemory错误。 您能否就我们如何解决问题向我们提出建议。

问题: OutofMemory错误

使用案例

1)加入120 GB的10GB文件

我们的群集配置有限制:

每个节点的8个节点和我们可以设置的最大执行器内存为7GB。 每个节点的最大容量为80GB。 核心总数:192核心 2)使用Spark Submit命令:

./spark-submit  --class com.service.Employee --master yarn-cluster --driver-memory 5G --executor-memory 7G --executor-cores 3 --num-executors 50  --conf spark.yarn.executor.memoryOverhead=1024  --conf spark.sql.shuffle.partitions=300 --conf spark.default.parallelism=300  --conf spark.dynamicAllocation.enabled=true --conf spark.serializer=org.apache.spark.serializer.KryoSerializer  /hadoop/project.jar 

2)     java.lang.OutOfMemoryError:超出GC开销限制

at java.lang.reflect.Array.newInstance(Array.java:75)

at java.io.ObjectInputStream.readArray(ObjectInputStream.java:1897)

at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1529)

at java.io.ObjectInputStream.readObject(ObjectInputStream.java:422)

at org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:76)

at org.apache.spark.serializer.DeserializationStream.readValue(Serializer.scala:171)

at org.apache.spark.serializer.DeserializationStream$$anon$2.getNext(Serializer.scala:201)

at org.apache.spark.serializer.DeserializationStream$$anon$2.getNext(Serializer.scala:198)

at org.apache.spark.util.NextIterator.hasNext(NextIterator.scala:73)

at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371)

at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)

at org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:32)

at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39)

at org.apache.spark.util.collection.ExternalAppendOnlyMap.insertAll(ExternalAppendOnlyMap.scala:152)

at org.apache.spark.Aggregator.combineValuesByKey(Aggregator.scala:45)

at org.apache.spark.shuffle.BlockStoreShuffleReader.read(BlockStoreShuffleReader.scala:89)

at org.apache.spark.rdd.ShuffledRDD.compute(ShuffledRDD.scala:98)

at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)

at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)

at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)

at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)

at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)

at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)

at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)

at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)

at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)

at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)

at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)

at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)

at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)

at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)

at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)

0 个答案:

没有答案