我正在AWS EMR集群上运行spark-submit应用程序(EMR 5.0.0,Spark 2.0.0,30 r3.4xlarge)。要启动脚本,我将SSH连接到主节点,然后运行以下命令:
hm <- strftime(as.POSIXct(tracks$V1, format="%m/%d/%Y %H:%M"), "%H:%M")
tracks <- tracks["06:00" < hm & hm < "18:00",]
## V1 V2 V3
##1 05/04/2015 16:04 53.38540 -6.29421
##2 05/04/2015 17:17 53.38464 -6.29412
##3 05/04/2015 17:33 53.38457 -6.29409
##4 05/04/2015 17:49 53.38463 -6.29418
##9 06/04/2015 07:13 53.38459 -6.29414
##10 06/04/2015 08:30 53.38460 -6.29414
##11 06/04/2015 16:56 53.38458 -6.29413
##12 06/04/2015 17:05 53.38469 -6.29416
##13 06/04/2015 17:13 53.38464 -6.29409
##14 06/04/2015 17:26 53.38463 -6.29412
##15 06/04/2015 17:39 53.38463 -6.29411
该应用程序使用默认的AWS spark配置,它具有spark.master = yarn和deploy-mode = client。
time spark-submit --conf spark.sql.shuffle.partitions=5000 \
--conf spark.memory.storageFraction=0.3 --conf spark.memory.fraction=0.95 \
--executor-memory 8G --driver-memory 10G dataframe_script.py
应用程序完成写入后,应用程序不会返回命令行> 10分钟,发出警告:
ERROR LiveListenerBus: Dropping SparkListenerEvent because no remaining room in event queue. This likely means one of the SparkListeners is too slow and cannot keep up with the rate at which tasks are being started by the scheduler.
WARN ExecutorAllocationManager: No stages are running, but numRunningTasks != 0
16/10/12 00:40:03 ERROR LiveListenerBus: SparkListenerBus has already stopped! Dropping event SparkListenerExecutorMetricsUpdate(176,WrappedArray())
[Stage 17:=================================================> (465 + 35) / 500]
有一个previous StackOverflow question,它指的是this JIRA。看起来修复了旧版本的Spark,但我不太清楚问题是什么。
答案 0 :(得分:0)