我对Spark执行器,驱动程序,执行程序核心,执行程序内存的值有一些疑问。
答案 0 :(得分:0)
如果群集上没有运行应用程序,如果您正在提交作业,那么Spark执行器,执行程序核心,执行程序内存的默认值是什么?
默认值存储在安装了spark的群集中的spark-defaults.conf
。所以你可以验证这些值。通常默认值为。
检查默认值。请参阅此document
如果我们想要计算您要提交的作业所需的Spark执行者,执行者核心,执行者内存的值,您将如何做?
取决于以下事项
你有什么类型的工作,即它是洗牌密集型或只有地图操作。如果它是shuffle你可能需要更多的内存。
数据大小,数据量越大,内存使用量越大
群集限制。你能承受多少记忆。
基于这些因素,您需要从一些数字开始,然后查看需要了解瓶颈并增加或减少内存占用量的spark UI。
保持执行程序内存超过40G的一个注意事项可能会更加高效,因为JVM GC会变慢。核心太多也可能会减慢进程。
答案 1 :(得分:0)
Avishek的答案涉及默认值。我将重点介绍最佳值的计算。让我们举个例子,
示例:6个节点,每个节点具有16个内核和64Gb RAM
每个执行程序都是JVM实例。因此,可以在节点上执行多个执行程序。
让我们开始选择每个执行者的核心数:
Number of cores = Concurrent tasks as executor can run
One may think if there is higher concurrency, performance will be better. However, experiments have shown that spark jobs perform well when the number of cores = 5.
If number of cores > 5, it leads to poor performance.
Note that 1 core and 1 Gb is needed for OS and Hadoop Daemons.
现在,计算执行者的数量:
As discussed earlier, there are 15 cores available for each node and we are planning for 5 cores per executors.
Thus number of executors per node = 15/5 = 3
Total number of executors = 3*6 = 18
Out of all executors, 1 executor is needed for AM management by YARN.
Thus, final executors count = 18-1 = 17 executors.
每个执行者的内存:
Executor per node = 3
RAM available per node = 63 Gb (as 1Gb is needed for OS and Hadoop Daemon)
Memory per executor = 63/3 = 21 Gb.
Some memory overhead is required by spark. Which is max(384, 7% of memory per executor).
Thus, 7% of 21 = 1.47
As 1.47Gb > 384Mb, subtract 1.47 from 21.
Hence, 21 - 1.47 ~ 19 Gb
最终号码:
Executors - 17, Cores 5, Executor Memory - 19 GB
注意:
1. Sometimes one may feel to allocate lesser memory than 19 Gb. As memory decreases, the number of executors will increase and the number of cores will decrease. As discussed earlier, number of cores = 5 is best value. However, if you reduce it will still give good results. Just dont exceed value beyond 5.
2. Memory per executor should be less than 40 else there will be a considerable GC overhead.