在spark中写入文件时出现问题

时间:2016-03-21 05:42:39

标签: scala apache-spark apache-spark-mllib

我使用以下选项在本地模式下处理spark

spark-shell --driver-memory 21G --executor-memory 10G --num-executors 4 --driver-java-options "-Dspark.executor.memory=10G"  --executor-cores 8

这是一个每个32G RAM的四节点集群。

我使用DIMSUM计算列相似度并尝试写入文件

它计算了6.7百万项的列相似度,并且当持久存档时会导致线程溢出问题。

dimSumOutput.coalesce(1, true).saveAsTextFile("/user/similarity")

dimSumOutput是一个RDD,它包含格式(row,col,sim)中的列相似性

16/03/20 21:41:22 INFO spark.ContextCleaner: Cleaned shuffle 2
16/03/20 21:41:25 INFO collection.ExternalSorter: Thread 184 spilling in-    memory map of 479.5 MB to disk (1 time so far)
16/03/20 21:41:26 INFO collection.ExternalSorter: Thread 185 spilling in-memory map of 479.5 MB to disk (1 time so far)
16/03/20 21:41:26 INFO collection.ExternalSorter: Thread 189 spilling in-memory map of 479.5 MB to disk (1 time so far)
16/03/20 21:41:28 INFO collection.ExternalSorter: Thread 188 spilling in-memory map of 478.4 MB to disk (1 time so far)
16/03/20 21:41:31 INFO collection.ExternalSorter: Thread 186 spilling in-memory map of 535.0 MB to disk (1 time so far)
16/03/20 21:41:32 INFO collection.ExternalSorter: Thread 187 spilling in-memory map of 609.3 MB to disk (1 time so far)
16/03/20 21:42:07 INFO collection.ExternalSorter: Thread 184 spilling in-memory map of 481.3 MB to disk (2 times so far)
16/03/20 21:42:14 INFO collection.ExternalSorter: Thread 185 spilling in-memory map of 479.5 MB to disk (2 times so far)
16/03/20 21:42:18 INFO collection.ExternalSorter: Thread 188 spilling in-memory map of 478.4 MB to disk (2 times so far)
16/03/20 21:42:21 INFO collection.ExternalSorter: Thread 189 spilling in-memory map of 491.5 MB to disk (2 times so far)
16/03/20 21:42:27 INFO collection.ExternalSorter: Thread 186 spilling in-memory map of 542.7 MB to disk (2 times so far)
16/03/20 21:42:32 INFO collection.ExternalSorter: Thread 187 spilling in-memory map of 583.7 MB to disk (2 times so far)
16/03/20 21:43:25 INFO collection.ExternalSorter: Thread 184 spilling in-memory map of 479.5 MB to disk (3 times so far)
16/03/20 21:43:33 INFO collection.ExternalSorter: Thread 185 spilling in-memory map of 479.5 MB to disk (3 times so far)
16/03/20 21:43:45 INFO collection.ExternalSorter: Thread 189 spilling in-memory map of 483.8 MB to disk (3 times so far)
16/03/20 21:43:50 INFO collection.ExternalSorter: Thread 188 spilling in-memory map of 478.4 MB to disk (3 times so far)
16/03/20 21:43:56 INFO collection.ExternalSorter: Thread 186 spilling in-memory map of 535.0 MB to disk (3 times so far)
16/03/20 21:44:01 INFO collection.ExternalSorter: Thread 187 spilling in-memory map of 624.6 MB to disk (3 times so far)
16/03/20 21:44:14 INFO collection.ExternalSorter: Thread 184 spilling in-memory map of 482.6 MB to disk (4 times so far)
16/03/20 21:44:20 INFO collection.ExternalSorter: Thread 185 spilling in-memory map of 479.5 MB to disk (4 times so far)
16/03/20 21:44:37 INFO collection.ExternalSorter: Thread 189 spilling in-memory map of 479.5 MB to disk (4 times so far)
16/03/20 21:45:09 INFO collection.ExternalSorter: Thread 188 spilling in-memory map of 478.4 MB to disk (4 times so far)
16/03/20 21:45:22 INFO collection.ExternalSorter: Thread 186 spilling in-memory map of 581.1 MB to disk (4 times so far)
16/03/20 21:45:23 INFO collection.ExternalSorter: Thread 187 spilling in-memory map of 539.5 MB to disk (4 times so far)
16/03/20 21:45:28 INFO collection.ExternalSorter: Thread 184 spilling in-memory map of 479.5 MB to disk (5 times so far)
16/03/20 21:45:40 INFO collection.ExternalSorter: Thread 185 spilling in-memory map of 486.4 MB to disk (5 times so far)
16/03/20 21:45:52 INFO collection.ExternalSorter: Thread 188 spilling in-memory map of 478.4 MB to disk (5 times so far)
16/03/20 21:45:59 INFO collection.ExternalSorter: Thread 189 spilling in-memory map of 479.5 MB to disk (5 times so far)
16/03/20 21:46:14 INFO collection.ExternalSorter: Thread 184 spilling in-memory map of 479.5 MB to disk (6 times so far)
16/03/20 21:46:24 INFO collection.ExternalSorter: Thread 187 spilling in-memory map of 539.6 MB to disk (5 times so far)
16/03/20 21:46:25 INFO collection.ExternalSorter: Thread 186 spilling in-memory map of 527.4 MB to disk (5 times so far)
16/03/20 21:47:11 INFO collection.ExternalSorter: Thread 185 spilling in-memory map of 576.0 MB to disk (6 times so far)
16/03/20 21:47:19 INFO collection.ExternalSorter: Thread 189 spilling in-memory map of 491.5 MB to disk (6 times so far)
16/03/20 21:47:20 INFO collection.ExternalSorter: Thread 188 spilling in-memory map of 478.4 MB to disk (6 times so far)
16/03/20 21:47:43 INFO collection.ExternalSorter: Thread 184 spilling in-memory map of 686.1 MB to disk (7 times so far)
16/03/20 21:47:50 INFO collection.ExternalSorter: Thread 187 spilling in-memory map of 539.5 MB to disk (6 times so far)
16/03/20 21:47:57 INFO collection.ExternalSorter: Thread 186 spilling in-memory map of 599.0 MB to disk (6 times so far)
16/03/20 21:48:04 INFO collection.ExternalSorter: Thread 185 spilling in-memory map of 481.3 MB to disk (7 times so far)
16/03/20 21:48:39 INFO collection.ExternalSorter: Thread 189 spilling in-memory map of 479.5 MB to disk (7 times so far)
16/03/20 21:48:40 INFO collection.ExternalSorter: Thread 188 spilling in-memory map of 478.4 MB to disk (7 times so far)
16/03/20 21:49:06 INFO collection.ExternalSorter: Thread 184 spilling in-memory map of 479.5 MB to disk (8 times so far)
16/03/20 21:49:21 INFO collection.ExternalSorter: Thread 186 spilling in-memory map of 519.5 MB to disk (7 times so far)
16/03/20 21:49:21 INFO collection.ExternalSorter: Thread 185 spilling in-memory map of 489.0 MB to disk (8 times so far)
16/03/20 21:49:28 INFO collection.ExternalSorter: Thread 187 spilling in-memory map of 540.2 MB to disk (7 times so far)
16/03/20 21:49:36 INFO collection.ExternalSorter: Thread 188 spilling in-memory map of 485.1 MB to disk (8 times so far)
16/03/20 21:49:39 INFO collection.ExternalSorter: Thread 189 spilling in-memory map of 601.6 MB to disk (8 times so far)
16/03/20 21:50:04 INFO collection.ExternalSorter: Thread 184 spilling in-memory map of 576.0 MB to disk (9 times so far)
16/03/20 21:50:20 INFO collection.ExternalSorter: Thread 186 spilling in-memory map of 519.7 MB to disk (8 times so far)
16/03/20 21:50:24 INFO collection.ExternalSorter: Thread 185 spilling in-memory map of 479.5 MB to disk (9 times so far)
16/03/20 21:50:27 INFO collection.ExternalSorter: Thread 187 spilling in-memory map of 539.5 MB to disk (8 times so far)
16/03/20 21:50:28 INFO collection.ExternalSorter: Thread 188 spilling in-memory map of 478.4 MB to disk (9 times so far)
16/03/20 21:51:03 INFO collection.ExternalSorter: Thread 189 spilling in-memory map of 489.0 MB to disk (9 times so far)
16/03/20 21:51:22 INFO collection.ExternalSorter: Thread 184 spilling in-memory map of 479.5 MB to disk (10 times so far)
16/03/20 21:51:41 INFO collection.ExternalSorter: Thread 186 spilling in-memory map of 519.5 MB to disk (9 times so far)
16/03/20 21:51:45 INFO collection.ExternalSorter: Thread 188 spilling in-memory map of 483.8 MB to disk (10 times so far)
16/03/20 21:51:45 INFO collection.ExternalSorter: Thread 185 spilling in-memory map of 479.5 MB to disk (10 times so far)
16/03/20 21:51:51 INFO collection.ExternalSorter: Thread 187 spilling in-memory map of 550.4 MB to disk (9 times so far)
16/03/20 21:52:04 INFO collection.ExternalSorter: Thread 189 spilling in-memory map of 479.5 MB to disk (10 times so far)
16/03/20 21:52:20 INFO collection.ExternalSorter: Thread 184 spilling in-memory map of 509.4 MB to disk (11 times so far)
16/03/20 21:52:40 INFO collection.ExternalSorter: Thread 185 spilling in-memory map of 479.5 MB to disk (11 times so far)

有关如何修复它的任何指示?

1 个答案:

答案 0 :(得分:1)

1)您使用--executor-memory 65G(大于32GB!)然后在同一命令行--driver-java-options "-Dspark.executor.memory=10G"上,这很奇怪。这是一个错字吗?如果没有,你确定这种电话的影响吗?请提供更多信息。

2)更重要的是,在您的4名工作人员处理数据之后,您要求Spark将数据合并到单个分区(因此在单个执行器上)。根据执行程序分配的内存(参见1),这可能意味着单个执行程序可以处理大量过多的记录。在这里,我将首先尝试确保为执行程序分配的内存量(例如,如果您使用它,请参阅Spark UIYarn UI)。然后我真的会考虑coalesce到1的需要。同样,@ Yaron建议您可以查看应用程序的shuffle相关设置,并更改spark.shuffle.memoryFraction(保留在与0.8求和时,请注意spark.storage.memoryFraction的最大值,请记住,较新版本的Spark会考虑不推荐使用此类设置。

相关问题