Question

我正在将一个镶木地板文件从DataFrame写入S3。当我查看Spark UI时，我可以看到所有任务，但1完成了写作阶段（例如199/200）。最后一项任务似乎需要永远完成，并且通常由于超出执行程序内存限制而失败。

我想知道最后一项任务发生了什么。如何优化呢？感谢。

Answer 1

我尝试了 Glemmie Helles Sindholt 解决方案，并且效果很好。这是代码：

path = 's3://...' n = 2 # number of repartitions, try 2 to test spark_df = spark_df.repartition(n) spark_df.write.mode("overwrite").parquet(path)

Answer 2

听起来你有数据偏差。您可以在写入S3之前调用var iframe; // The done passed in is the function to decide when to end the waiting. // When you call `done`, the promise return by `browser.executeAsyncScript` knows its resolved // and it'll start to execute whats in the `.then`. return browser.executeAsyncScript(function (done) { // Dunno what `receiveMessage` is, or whether `done` is accessible to it, //so I just created another function. // If its separate from one in `Client1`, than put the codes in `endWaitHandler` here. window.addEventListener("message", receiveMessage); var endWaitHandler = function() { // When we receive the message, call done to resolve promise done(); // Remove the endWaitHandler so it won't get register many times if // this function called multiple times. window.removeEventListener("message", endWaitHandler); }; // Register before sending the message. window.addEventListener("message", endWaitHandler); iframe = document.getElementById("myIframe"); iframe.contentWindow.postMessage("message", "*"); }).then(function () { });上的repartition来解决此问题。

Answer 3

这篇文章 - The Bleeding Edge: Spark, Parquet and S3有很多关于Spark，S3和Parquet的有用信息。特别是，它讨论了驱动程序如何最终写出_common_metadata_文件，并且可能需要相当多的时间。有一种方法可以关闭它。

不幸的是，他们说他们自己继续生成公共元数据，但是并没有真正谈论他们是如何做到的。

Answer 4

正如其他人所指出的，数据倾斜可能正在起作用。

除此之外，我注意到您的任务计数为200。

配置参数spark.sql.shuffle.partitions配置在对数据进行混洗以进行联接或聚合时使用的分区数。

200是此设置的默认设置，但通常距离最佳值还很远。

对于小数据，200可能会过大，并且您会在多个分区的开销中浪费时间。

对于大数据，200可以导致大分区，应该将其分解为更多，更小的分区。

真正的经验法则是： -具有2-3倍于CPU的分区数量。 -或〜128MB。

2GB是最大分区大小。如果您将鼠标悬停在2000个分区以下，则当分区数大于2000 [1]时，Spark使用不同的数据结构进行随机记录

private[spark] object MapStatus {

  def apply(loc: BlockManagerId, uncompressedSizes: Array[Long]): MapStatus = {
    if (uncompressedSizes.length > 2000) {
      HighlyCompressedMapStatus(loc, uncompressedSizes)
    } else {
      new CompressedMapStatus(loc, uncompressedSizes)
    }
  }
...

您可以尝试在运行时使用此参数：

spark.conf.set("spark.sql.shuffle.partitions", "300")

[1] What should be the optimal value for spark.sql.shuffle.partitions or how do we increase partitions when using Spark SQL?

Spark写Parquet到S3最后一项任务需要永远

4 个答案: