Question

我设置了一个具有31个（1个主节点+ 30个节点）m4.4大型计算机（16核，64G mem）的EMR Spark集群。当我启动Spark-shell时，我注意到所有执行程序都在短时间内（约10秒内）因“ GC（分配失败）”而死亡。我在网上搜索，发现原因是由于请求的资源比计算机更多，但我的情况并非如此，因为我每台计算机仅请求12核心+ 20G内存。

我的环境-Spark版本2.4.2。 Scala版本2.11.12（OpenJDK 64位服务器VM，Java 1.8.0_201）

我使用的命令

spark-shell --deploy-mode client --master yarn \
--conf spark.driver.maxResultSize=0  
--conf spark.serializer=org.apache.spark.serializer.KryoSerializer \
--conf spark.kryoserializer.buffer.max=1g \
--num-executors 15 --executor-cores 12 \
--driver-memory 10G --executor-memory 10G    \
--conf spark.driver.memoryOverhead=10G --conf spark.executor.memoryOverhead=10G

执行人死亡的错误

Log Type: stdout

Log Upload Time: Fri May 22 05:19:26 +0000 2020

Log Length: 2759

2020-05-22T05:16:56.702+0000: [GC (Allocation Failure) 2020-05-22T05:16:56.702+0000: [ParNew: 275328K->13813K(309696K), 0.0131323 secs] 275328K->13813K(997824K), 0.0132452 secs] [Times: user=0.06 sys=0.03, real=0.02 secs] 
2020-05-22T05:16:57.549+0000: [GC (Allocation Failure) 2020-05-22T05:16:57.549+0000: [ParNew: 289141K->32049K(309696K), 0.0207406 secs] 289141K->48435K(997824K), 0.0208144 secs] [Times: user=0.16 sys=0.04, real=0.02 secs] 
2020-05-22T05:16:57.570+0000: [GC (CMS Initial Mark) [1 CMS-initial-mark: 16386K(688128K)] 52997K(997824K), 0.0009641 secs] [Times: user=0.01 sys=0.00, real=0.00 secs] 
2020-05-22T05:16:57.571+0000: [CMS-concurrent-mark-start]
2020-05-22T05:16:57.572+0000: [CMS-concurrent-mark: 0.002/0.002 secs] [Times: user=0.01 sys=0.00, real=0.01 secs] 
2020-05-22T05:16:57.572+0000: [CMS-concurrent-preclean-start]
2020-05-22T05:16:57.574+0000: [CMS-concurrent-preclean: 0.002/0.002 secs] [Times: user=0.01 sys=0.00, real=0.00 secs] 
2020-05-22T05:16:57.574+0000: [CMS-concurrent-abortable-preclean-start]
 CMS: abort preclean due to time 2020-05-22T05:17:02.638+0000: [CMS-concurrent-abortable-preclean: 1.463/5.064 secs] [Times: user=2.32 sys=0.03, real=5.06 secs] 
2020-05-22T05:17:02.639+0000: [GC (CMS Final Remark) [YG occupancy: 136011 K (309696 K)]2020-05-22T05:17:02.639+0000: [Rescan (parallel) , 0.0033639 secs]2020-05-22T05:17:02.642+0000: [weak refs processing, 0.0000326 secs]2020-05-22T05:17:02.642+0000: [class unloading, 0.0032369 secs]2020-05-22T05:17:02.645+0000: [scrub symbol table, 0.0035727 secs]2020-05-22T05:17:02.649+0000: [scrub string table, 0.0002802 secs][1 CMS-remark: 16386K(688128K)] 152397K(997824K), 0.0112250 secs] [Times: user=0.04 sys=0.00, real=0.01 secs] 
2020-05-22T05:17:02.650+0000: [CMS-concurrent-sweep-start]
2020-05-22T05:17:02.650+0000: [CMS-concurrent-sweep: 0.000/0.000 secs] [Times: user=0.00 sys=0.00, real=0.00 secs] 
2020-05-22T05:17:02.650+0000: [CMS-concurrent-reset-start]
2020-05-22T05:17:02.729+0000: [CMS-concurrent-reset: 0.079/0.079 secs] [Times: user=0.02 sys=0.06, real=0.08 secs] 
Heap
 par new generation   total 309696K, used 148816K [0x0000000540000000, 0x0000000555000000, 0x0000000583990000)
  eden space 275328K,  42% used [0x0000000540000000, 0x0000000547207940, 0x0000000550ce0000)
  from space 34368K,  93% used [0x0000000550ce0000, 0x0000000552c2c6f8, 0x0000000552e70000)
  to   space 34368K,   0% used [0x0000000552e70000, 0x0000000552e70000, 0x0000000555000000)
 concurrent mark-sweep generation total 688128K, used 16386K [0x0000000583990000, 0x00000005ad990000, 0x00000007c0000000)
 Metaspace       used 27490K, capacity 27784K, committed 28064K, reserved 1073152K
  class space    used 3669K, capacity 3778K, committed 3884K, reserved 1048576K

火花UI

当前，我只能在没有驱动器/执行器内存设置的情况下成功启动它，但是当我运行不大的作业时，执行器和Spark-shell会消失。

$ spark-shell --deploy-mode client --master yarn  --num-executors 15 --executor-cores 12 --conf spark.serializer=org.apache.spark.serializer.KryoSerializer --conf spark.kryoserializer.buffer.max=1g
....
scala> val validationDataPath: String = "/data/model/df_valid.parquet"
scala> val validation_df = spark.read.parquet(validationDataPath)
validation_df: org.apache.spark.sql.DataFrame = [features: vector, click: bigint ... 5 more fields]
scala> validation_df.count
res0: Long = 4816120

scala> validation_df.rdd.first
#
# java.lang.OutOfMemoryError: Java heap space
# -XX:OnOutOfMemoryError="kill -9 %p"
#   Executing /bin/sh -c "kill -9 63635"...
/usr/lib/spark/bin/spark-shell: line 47: 63635 Killed                  "${SPARK_HOME}"/bin/spark-submit --class org.apache.spark.repl.Main --name "Spark shell" "$@"

启动后，所有执行程序都死机了

0 个答案: