Spark:即使输出数据非常小,合并也很慢

时间:2015-06-25 17:02:44

标签: scala apache-spark coalesce

我在Spark中有以下代码:

myData.filter(t => t.getMyEnum() == null)
      .map(t => t.toString)
      .saveAsTextFile("myOutput")

myOutput文件夹中有2000多个文件,但只有少数t.getMyEnum()== null,因此输出记录非常少。由于我不想只搜索2000+输出文件中的几个输出,我尝试使用coalesce组合输出,如下所示:

myData.filter(t => t.getMyEnum() == null)
      .map(t => t.toString)
      .coalesce(1, false)
      .saveAsTextFile("myOutput")

然后工作变得极其缓慢!我想知道为什么这么慢?在2000多个分区中只有几个输出记录散布?有没有更好的方法来解决这个问题?

1 个答案:

答案 0 :(得分:13)

if you're doing a drastic coalesce, e.g. to numPartitions = 1, this may result in your computation taking place on fewer nodes than you like (e.g. one node in the case of numPartitions = 1). To avoid this, you can pass shuffle = true. This will add a shuffle step, but means the current upstream partitions will be executed in parallel (per whatever the current partitioning is). Note: With shuffle = true, you can actually coalesce to a larger number of partitions. This is useful if you have a small number of partitions, say 100, potentially with a few partitions being abnormally large. Calling coalesce(1000, shuffle = true) will result in 1000 partitions with the data distributed using a hash partitioner. So try by passing the true to coalesce function. i.e. myData.filter(_.getMyEnum == null) .map(_.toString) .coalesce(1, shuffle = true) .saveAsTextFile("myOutput")