Question

我是Spark的新人。我正在检查测试应用程序中的洗牌问题，而且我不知道为什么在我的程序中mapPartitionsWithIndex方法会导致洗牌！正如你在图片中看到的那样，我的初始RDD有两个16MB分区，Shuffle写入大约49.8 MB。我知道map或mapPartition或mapPartitionsWithIndex并不像groupByKey那样改变转型，但我发现它们也会导致Spark的混乱。为什么呢？

enter image description here

Answer 1

我认为你在mapPartitionsWithIndex之后执行了一些连接/组操作，这导致了shuffle。

您可以通过修改代码来建立它。

当前代码

val rdd = inputRDD1.mapPartitionsWithIndex(....)
val outRDD = rdd.join(inputRDD2)

修改后的代码

val rdd = inputRDD1.mapPartitionsWithIndex(....)
println(rdd.count)

为什么mapPartitionsWithIndex导致Spark中的混乱？

1 个答案: