Question

我有一个大约200-600 gb数据的数据框，我正在阅读，操作，然后使用弹性地图缩小群集上的spark shell (scala)写入csv。即使在8小时后，写入CSV的Spark也会失败< / p>

这是我写给csv的方式：

result.persist.coalesce(20000).write.option("delimiter",",").csv("s3://bucket-name/results")

结果变量是通过来自其他一些数据帧的混合列创建的： var result=sources.join(destinations, Seq("source_d","destination_d")).select("source_i","destination_i")

现在，我能够在大约22分钟内读取它所基于的csv数据。在同一个程序中，我还可以在8分钟内将另一个（较小的）数据帧写入csv。但是，对于此result数据帧，它需要8个多小时但仍然失败...说其中一个连接已关闭。

我也在13 x c4.8xlarge instances on ec2上运行这个工作，每个工作36个核心和60 GB的ram，所以我认为我有能力写入csv，特别是8小时后。

许多阶段需要重试或任务失败，我无法弄清楚我做错了什么或为什么这么长时间。我可以从Spark UI看到它甚至从未进入写入CSV阶段并忙于持续阶段，但没有持久性功能，它仍然在8小时后失败。有任何想法吗？非常感谢帮助！

更新

我运行了以下命令将result变量重新分区为66K分区：

val r2 = result.repartition(66000) #confirmed with numpartitions
r2.write.option("delimiter",",").csv("s3://s3-bucket/results")

然而，即使在几个小时后，工作仍然失败。我到底做错了什么？

请注意，我正在通过spark-shell yarn --driver-memory 50G

运行spark shell

更新2：

我尝试先使用persist执行写操作：

r2.persist(StorageLevel.MEMORY_AND_DISK)

但我有很多阶段失败，返回a，Job aborted due to stage failure: ShuffleMapStage 10 (persist at <console>:36) has failed the maximum allowable number of times: 4. Most recent failure reason: org.apache.spark.shuffle.MetadataFetchFailedException: Missing an output location for shuffle 3'或说Connection from ip-172-31-48-180.ec2.internal/172.31.48.180:7337 closed

执行人页面