Question

编辑：答案有帮助，但我在memoryOverhead issue in Spark中描述了我的解决方案。

我有一个带有202092分区的RDD，它读取其他人创建的数据集。我可以手动看到数据在分区之间不平衡，例如其中一些有0个图像而其他有4k，而平均值为432.在处理数据时，我收到了这个错误：

Container killed by YARN for exceeding memory limits. 16.9 GB of 16 GB physical memory used. Consider boosting spark.yarn.executor.memoryOverhead.

而memoryOverhead已经提升了。我觉得发生了一些使得Yarn杀死我的容器的尖峰，因为尖峰溢出了指定的边界。

那么我该怎么做才能确保我的数据 （大致） 跨分区平衡？ < / p>

我的想法是repartition()会起作用，它会调用洗牌：

dataset = dataset.repartition(202092)

但是我得到了同样的错误，尽管有programming-guide的说明：

重新分区（numPartitions）

随机重新调整RDD中的数据以创建更多或更少的数据   分区和 在它们之间进行平衡 。这总是随机播放所有数据   通过网络。

检查我的玩具示例：

data = sc.parallelize([0,1,2], 3).mapPartitions(lambda x: range((x.next() + 1) * 1000))
d = data.glom().collect()
len(d[0])     # 1000
len(d[1])     # 2000
len(d[2])     # 3000
repartitioned_data = data.repartition(3)
re_d = repartitioned_data.glom().collect()
len(re_d[0])  # 1854
len(re_d[1])  # 1754
len(re_d[2])  # 2392
repartitioned_data = data.repartition(6)
re_d = repartitioned_data.glom().collect()
len(re_d[0])  # 422
len(re_d[1])  # 845
len(re_d[2])  # 1643
len(re_d[3])  # 1332
len(re_d[4])  # 1547
len(re_d[5])  # 211
repartitioned_data = data.repartition(12)
re_d = repartitioned_data.glom().collect()
len(re_d[0])  # 132
len(re_d[1])  # 265
len(re_d[2])  # 530
len(re_d[3])  # 1060
len(re_d[4])  # 1025
len(re_d[5])  # 145
len(re_d[6])  # 290
len(re_d[7])  # 580
len(re_d[8])  # 1113
len(re_d[9])  # 272
len(re_d[10]) # 522
len(re_d[11]) # 66

Answer 1

我认为超出问题的内存开销限制是由于在获取期间使用的DirectMemory缓冲区。我认为它已在2.0.0中修复。（我们遇到了同样的问题，但是当我们发现升级到2.0.0解决了它时停止了更深入的挖掘。不幸的是我没有Spark问题数据来支持我。）

repartition之后的不均匀分区令人惊讶。与https://github.com/apache/spark/blob/v2.0.0/core/src/main/scala/org/apache/spark/rdd/RDD.scala#L443对比。 Spark甚至会在repartition中生成随机密钥，因此不能使用可能存在偏差的哈希值。

我尝试了您的示例，并使用Spark 1.6.2和Spark 2.0.0获得完全相同的结果。但不是来自Scala spark-shell：

scala> val data = sc.parallelize(1 to 3, 3).mapPartitions { it => (1 to it.next * 1000).iterator }
data: org.apache.spark.rdd.RDD[Int] = MapPartitionsRDD[6] at mapPartitions at <console>:24

scala> data.mapPartitions { it => Iterator(it.toSeq.size) }.collect.toSeq
res1: Seq[Int] = WrappedArray(1000, 2000, 3000)

scala> data.repartition(3).mapPartitions { it => Iterator(it.toSeq.size) }.collect.toSeq
res2: Seq[Int] = WrappedArray(1999, 2001, 2000)

scala> data.repartition(6).mapPartitions { it => Iterator(it.toSeq.size) }.collect.toSeq
res3: Seq[Int] = WrappedArray(999, 1000, 1000, 1000, 1001, 1000)

scala> data.repartition(12).mapPartitions { it => Iterator(it.toSeq.size) }.collect.toSeq
res4: Seq[Int] = WrappedArray(500, 501, 501, 501, 501, 500, 499, 499, 499, 499, 500, 500)

这样漂亮的隔断！

_{（对不起，这不是一个完整的答案。到目前为止我只想分享我的发现。）}

如何跨分区平衡我的数据？

1 个答案: