Question

我试图了解我的spark-submit和spark shell作业之间的速度差异。我以相同的资源分配启动Shell或提交，但性能似乎截然不同。当我在shell中运行它时，花了大约10分钟与hr +并提交了Spark。然后我的问题是，REPL的进度栏中显示的任务数量是否与spark提交中运行的执行程序数量相同？我看到的每个数字都有很大不同，我想知道自己是否做错了什么。

在外壳中，我以

开头

    --executor-cores 5 \
    --executor-memory 16g \
    --driver-memory 230g \
    --conf "spark.driver.maxResultSize=100g" \
    --conf "spark.network.timeout=360s

我看到950个并发任务

... pandas_df = intent_dict_rdd.map(lambda x: Row(**x)).toDF().toPandas()
[Stage 1:==============================>                  (19503 + 950) / 31641]

我确实以相同的资源分配来提交，我只看到189个执行者

18/07/19 23:44:25 INFO StandaloneAppClient$ClientEndpoint: Executor added: app-20180719234425-0001/189 on worker-20180719233757-10.0.108.198-33953 (10.0.108.198:33953) with 5 cores
18/07/19 23:44:25 INFO StandaloneSchedulerBackend: Granted executor ID app-20180719234425-0001/189 on hostPort 10.0.108.198:33953 with 5 cores, 16.0 GB RAM

我正在使用10台m5.24xlarge机器，因此每台机器有96个内核和384GB内存。总共有960个内核，看起来更像我看到的任务数量。执行器的数量看起来更像是960/5内核。我是在专注于错误的事情吗？还有其他关于Spark提交与Spark Shell性能不佳的解释吗？

Spark Submit Executors == Spark Shell任务？

0 个答案: