Question

我的输入数据集大约是150G。我正在设置

--conf spark.cores.max=100 
--conf spark.executor.instances=20 
--conf spark.executor.memory=8G 
--conf spark.executor.cores=5 
--conf spark.driver.memory=4G

但由于数据不是均匀分布在执行者之间，我一直在

Container killed by YARN for exceeding memory limits. 9.0 GB of 9 GB physical memory used

这是我的问题：

1. Did I not set up enough memory in the first place? I think 20 * 8G > 150G, but it's hard to make perfect distribution, so some executors will suffer
2. I think about repartition the input dataFrame, so how can I determine how many partition to set? the higher the better, or?
3. The error says "9 GB physical memory used", but i only set 8G to executor memory, where does the extra 1G come from?

谢谢！

Answer 1

9GB由作为参数添加的8GB执行程序内存组成，spark.yarn.executor.memoryOverhead设置为.1，因此容器的总内存为spark.yarn.executor.memoryOverhead + (spark.yarn.executor.memoryOverhead * spark.yarn.executor.memoryOverhead)，即8GB + (.1 * 8GB) ≈ 9GB {1}}。

您可以使用单个执行程序运行整个过程，但这需要很长时间。 To understand this you need to know the notion of partitions and tasks.分区数由您的输入和操作定义。例如，如果您从hdfs读取150gb csv并且hdfs blocksize为128mb，则最终会得到150 * 1024 / 128 = 1200个分区，这些分区直接映射到Spark UI中的1200个任务。

执行者将接收每一项任务。你不需要将所有150GB的内存保存在内存中。例如，当你有一个执行器时，你显然不会受益于Spark的并行功能，但它只会从第一个任务开始，处理数据并将其保存回dfs，然后开始工作在下一个任务中。

你应该检查什么：

输入分区有多大？ Is the input file splittable at all?如果单个执行程序必须加载大量内存，肯定会耗尽内存。
你在做什么样的动作？例如，如果您使用非常低的基数进行连接，则最终会生成大量分区，因为具有特定值的所有行最终都位于相同的分区中。
执行了非常昂贵或低效的操作？任何笛卡尔产品等。

希望这会有所帮助。快乐的火花！

Answer 2

使用纱线时，还有另一个设置可以确定纱线容器对执行器的要求有多大：

spark.yarn.executor.memoryOverhead

默认为0.1 *执行程序内存设置。它定义了除了指定为执行程序内存之外还需要多少额外开销内存。请先尝试增加此号码。

此外，纱线容器不会给你任意大小的记忆。它只返回分配了内存大小的容器，该内存大小是它的最小分配大小的倍数，该大小由此设置控制：

yarn.scheduler.minimum-allocation-mb

将其设置为较小的数字可以降低您过度使用的风险。您要求的金额。

我通常还将下面的键设置为大于我所需容器大小的值，以确保spark请求控制我的执行程序的大小，而不是纱线踩它们。这是纱线最大容器尺寸。

nodemanager.resource.memory-mb

Spark：执行程序内存超出物理限制

2 个答案: