在spark2.0中,我有两个数据帧,我需要先加入它们并执行reduceByKey来聚合数据。我总是在遗嘱执行人中获得OOM。提前谢谢。
d1(1G,5亿行,缓存,由col id2分区)
id1 id2
1 1
1 3
1 4
2 0
2 7
...
d2(160G,200万行,缓存,按col id2分区,值col包含5000个浮点数的列表)
id2 value
0 [0.1, 0.2, 0.0001, ...]
1 [0.001, 0.7, 0.0002, ...]
...
现在我需要加入两个表来获取d3并使用spark.sql
select d1.id1, d2.value
FROM d1 JOIN d2 ON d1.id2 = d2.id2
然后我在d3上执行 reduceByKey 并为表d1中的每个id1聚合值
d4 = d3.rdd.reduceByKey(lambda x, y: numpy.add(x, y)) \
.mapValues(lambda x: (x / numpy.linalg.norm(x, 1)).toList)\
.toDF()
我估计d4的大小是340G。现在我在r3.8xlarge机器上使用来运行作业
mem: 244G
cpu: 64
Disk: 640G
我玩了一些配置,但我总是在执行器中得到OOM。所以,问题是
是否可以在当前类型的机器上运行此作业?或者我应该使用更大的机器(多大?)。但是我记得我遇到的文章/博客说用相对较小的机器进行太字节处理。
我应该做些什么改进?例如火花配置,代码优化?
是否可以估算每个执行者所需的内存量?
我试过的一些Spark配置
CONFIG1:
--verbose
--conf spark.sql.shuffle.partitions=200
--conf spark.dynamicAllocation.enabled=false
--conf spark.driver.maxResultSize=24G
--conf spark.shuffle.blockTransferService=nio
--conf spark.serializer=org.apache.spark.serializer.KryoSerializer
--conf spark.kryoserializer.buffer.max=2000M
--conf spark.rpc.message.maxSize=800
--conf "spark.executor.extraJavaOptions=-verbose:gc - XX:+PrintGCDetails -XX:+PrintGCTimeStamps -XX:MetaspaceSize=100M"
--num-executors 4
--executor-memory 48G
--executor-cores 15
--driver-memory 24G
--driver-cores 3
CONFIG2:
--verbose
--conf spark.sql.shuffle.partitions=10000
--conf spark.dynamicAllocation.enabled=false
--conf spark.driver.maxResultSize=24G
--conf spark.shuffle.blockTransferService=nio
--conf spark.serializer=org.apache.spark.serializer.KryoSerializer
--conf spark.kryoserializer.buffer.max=2000M
--conf spark.rpc.message.maxSize=800
--conf "spark.executor.extraJavaOptions=-verbose:gc -XX:+PrintGCDetails -XX:+PrintGCTimeStamps -XX:MetaspaceSize=100M"
--num-executors 4
--executor-memory 48G
--executor-cores 15
--driver-memory 24G
--driver-cores 3
配置3:
--verbose
--conf spark.sql.shuffle.partitions=10000
--conf spark.dynamicAllocation.enabled=true
--conf spark.driver.maxResultSize=6G
--conf spark.shuffle.blockTransferService=nio
--conf spark.serializer=org.apache.spark.serializer.KryoSerializer
--conf spark.kryoserializer.buffer.max=2000M
--conf spark.rpc.message.maxSize=800
--conf "spark.executor.extraJavaOptions=-verbose:gc -XX:+PrintGCDetails -XX:+PrintGCTimeStamps -XX:MetaspaceSize=100M"
--executor-memory 6G
--executor-cores 2
--driver-memory 6G
--driver-cores 3
配置4:
--verbose
--conf spark.sql.shuffle.partitions=20000
--conf spark.dynamicAllocation.enabled=false
--conf spark.driver.maxResultSize=6G
--conf spark.shuffle.blockTransferService=nio
--conf spark.serializer=org.apache.spark.serializer.KryoSerializer
--conf spark.kryoserializer.buffer.max=2000M
--conf spark.rpc.message.maxSize=800
--conf "spark.executor.extraJavaOptions=-verbose:gc -XX:+PrintGCDetails -XX:+PrintGCTimeStamps -XX:MetaspaceSize=100M"
--num-executors 13
--executor-memory 15G
--executor-cores 5
--driver-memory 13G
--driver-cores 5
来自执行者的OOM Error1
ExecutorLostFailure (executor 14 exited caused by one of the running tasks) Reason: Container killed by YARN for exceeding memory limits. 9.1 GB of 9 GB physical memory used. Consider boosting spark.yarn.executor.memoryOverhead.
Heap
PSYoungGen total 1830400K, used 1401721K [0x0000000740000000, 0x00000007be900000, 0x00000007c0000000)
eden space 1588736K, 84% used [0x0000000740000000,0x0000000791e86980,0x00000007a0f80000)
from space 241664K, 24% used [0x00000007af600000,0x00000007b3057de8,0x00000007be200000)
to space 236032K, 0% used [0x00000007a0f80000,0x00000007a0f80000,0x00000007af600000)
ParOldGen total 4194304K, used 4075884K [0x0000000640000000, 0x0000000740000000, 0x0000000740000000)
object space 4194304K, 97% used [0x0000000640000000,0x0000000738c5b198,0x0000000740000000)
Metaspace used 59721K, capacity 60782K, committed 61056K, reserved 1101824K
class space used 7421K, capacity 7742K, committed 7808K, reserved 1048576K
执行者的OOM错误2
ExecutorLostFailure (executor 7 exited caused by one of the running tasks) Reason: Container marked as failed: container_1477662810360_0002_01_000008 on host: ip-172-18-9-130.ec2.internal. Exit status: 52. Diagnostics: Exception from container-launch.
Heap
PSYoungGen total 1968128K, used 1900544K [0x0000000740000000, 0x00000007c0000000, 0x00000007c0000000)
eden space 1900544K, 100% used [0x0000000740000000,0x00000007b4000000,0x00000007b4000000)
from space 67584K, 0% used [0x00000007b4000000,0x00000007b4000000,0x00000007b8200000)
to space 103936K, 0% used [0x00000007b9a80000,0x00000007b9a80000,0x00000007c0000000)
ParOldGen total 4194304K, used 4194183K [0x0000000640000000, 0x0000000740000000, 0x0000000740000000)
object space 4194304K, 99% used [0x0000000640000000,0x000000073ffe1f38,0x0000000740000000)
Metaspace used 59001K, capacity 59492K, committed 61056K, reserved 1101824K
class space used 7300K, capacity 7491K, committed 7808K, reserved 1048576K
容器错误
16/10/28 14:33:21 ERROR CoarseGrainedExecutorBackend: RECEIVED SIGNAL TERM
16/10/28 14:33:26 ERROR Utils: Uncaught exception in thread stdout writer for python
java.lang.OutOfMemoryError: Java heap space
at org.apache.spark.sql.catalyst.expressions.UnsafeRow.copy(UnsafeRow.java:504)
at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown Source)
at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$doExecute$3$$anon$2.hasNext(WholeStageCodegenExec.scala:386)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
at org.apache.spark.api.python.SerDeUtil$AutoBatchedPickler.next(SerDeUtil.scala:120)
at org.apache.spark.api.python.SerDeUtil$AutoBatchedPickler.next(SerDeUtil.scala:112)
at scala.collection.Iterator$class.foreach(Iterator.scala:893)
at org.apache.spark.api.python.SerDeUtil$AutoBatchedPickler.foreach(SerDeUtil.scala:112)
at org.apache.spark.api.python.PythonRDD$.writeIteratorToStream(PythonRDD.scala:504)
at org.apache.spark.api.python.PythonRunner$WriterThread$$anonfun$run$3.apply(PythonRDD.scala:328)
at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1877)
at org.apache.spark.api.python.PythonRunner$WriterThread.run(PythonRDD.scala:269)
16/10/28 14:33:36 ERROR Utils: Uncaught exception in thread driver-heartbeater
16/10/28 14:33:26 ERROR Utils: Uncaught exception in thread stdout writer for python
java.lang.OutOfMemoryError: GC overhead limit exceeded
at java.lang.Double.valueOf(Double.java:519)
at org.apache.spark.sql.catalyst.expressions.UnsafeArrayData.get(UnsafeArrayData.java:138)
at org.apache.spark.sql.catalyst.util.ArrayData.foreach(ArrayData.scala:135)
at org.apache.spark.sql.execution.python.EvaluatePython$.toJava(EvaluatePython.scala:64)
at org.apache.spark.sql.execution.python.EvaluatePython$.toJava(EvaluatePython.scala:57)
at org.apache.spark.sql.Dataset$$anonfun$55.apply(Dataset.scala:2517)
at org.apache.spark.sql.Dataset$$anonfun$55.apply(Dataset.scala:2517)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
at org.apache.spark.api.python.SerDeUtil$AutoBatchedPickler.next(SerDeUtil.scala:121)
at org.apache.spark.api.python.SerDeUtil$AutoBatchedPickler.next(SerDeUtil.scala:112)
at scala.collection.Iterator$class.foreach(Iterator.scala:893)
at org.apache.spark.api.python.SerDeUtil$AutoBatchedPickler.foreach(SerDeUtil.scala:112)
at org.apache.spark.api.python.PythonRDD$.writeIteratorToStream(PythonRDD.scala:504)
at org.apache.spark.api.python.PythonRunner$WriterThread$$anonfun$run$3.apply(PythonRDD.scala:328)
at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1877)
at org.apache.spark.api.python.PythonRunner$WriterThread.run(PythonRDD.scala:269)
16/10/28 14:33:43 ERROR SparkUncaughtExceptionHandler: [Container in shutdown] Uncaught exception in thread Thread[stdout writer for python,5,main]
如果我按id2分区,看起来数据d1非常偏斜。因此,连接将导致OOM。如果d1按照我之前的想法均匀分布,那么上面的配置应该可行。
我发布了解决问题的尝试,以防有人遇到类似的问题。
我的问题是,如果我用id2对d1进行分区,那么数据就会非常偏斜。结果存在一些包含几乎所有id1的分区。因此,与d2的JOIN将导致OOM错误。为了缓解这种问题,我首先从id2中识别出一个子集s
,如果按id2进行分区,可能会导致这种偏差数据。然后我从d2创建一个d5,仅包括s
和d6中的d6,不包括s
。幸运的是,d5的尺寸不算太大。所以,我可以用d5广播连接d1。然后我加入d1和d6。然后,我将两个结果联合起来并执行reduceByKey。我非常接近解决问题。我没有继续这样做,因为我的d1可能会在以后大幅增长。换句话说,这种方法对我来说并不是真正可扩展的
幸运的是,在我的情况下,d2中的大多数值都非常小。根据我的应用程序,我可以安全地删除小值并将向量转换为sparseVector以显着减小d2的大小。执行此操作后,我通过id1分区d1并广播连接d2(删除小值后)。当然,必须增加驱动程序内存以允许相对较大的广播变量。这适用于我,也适用于我的应用程序。
答案 0 :(得分:5)
这里有一些尝试:稍微减少执行程序的大小。你现在得到了:
--executor-memory 48G
--executor-cores 15
放手一搏:
--executor-memory 16G
--executor-cores 5
由于各种原因,较小的执行程序大小似乎是最佳的。其中之一是java堆大小超过32G会导致对象引用从4个字节变为8个,并且所有内存需求都会爆炸。
编辑:问题可能实际上是d4分区太大(尽管其他建议仍然适用!)。您可以通过将d3重新分区到更大数量的分区(大致为d1 * 4),或将其传递给numPartitions
reduceByKey
的可选参数来解决此问题。这两个选项都会引发混乱,但这比崩溃更好。
答案 1 :(得分:2)
我遇到了同样的问题,但是我搜索了许多无法解决我问题的答案。
最终,我将逐步调试代码。我发现由每个分区的数据大小导致的问题不平衡。只需<script src="https://ajax.googleapis.com/ajax/libs/jquery/1.9.1/jquery.min.js"></script>
<table>
<thead>
<tr>
<th>
<div>
<input type="text" placeholder="Search for names.." class="myInput">
</div>
Firstname
</th>
<th>
<div>
<input type="text" placeholder="Search for lastnames.." class="myInput"> </div>
Lastname
</th>
<th>
<div>
<input type="text" placeholder="Search for email.." class="myInput">
</div>
Email
</th>
<th>
<select>
<option></option>
<option>x</option>
<option>N/A</option>
</select>
decide
</th>
</tr>
</thead>
<tbody id="myTable">
<tr>
<td>John</td>
<td>Doe</td>
<td>john@example.com</td>
<td>x</td>
</tr>
<tr>
<td>Mary</td>
<td>Moe</td>
<td>mary@mail.com</td>
</tr>
<tr>
<td>July</td>
<td>Dooley</td>
<td>july@greatstuff.com</td>
<td>x</td>
</tr>
<tr>
<td>Anja</td>
<td>Ravendale</td>
<td>a_r@test.com</td>
<td></td>
</tr>
</tbody>
</table>