Question

Now I have 3 RDDs like this:

rdd1:

rdd2:

11 12
13 14

rdd3:

15 16
17 18
19 20

and I want to do this:

rdd1.zip(rdd2.union(rdd3))

and I want the result is like this:

but I have an exception like this:

Exception in thread "main" java.lang.IllegalArgumentException: Can't zip RDDs with unequal numbers of partitions

someone tell me I can do this without exception:

rdd1.zip(rdd2.union(rdd3).repartition(1))

But it seems like it is a little cost. So I want to know if there is other ways to solve this problem.

Answer 1

我不确定你的“费用”是什么意思，但你怀疑repartition(1)不是正确的解决方案。它会将RDD重新分区为单个分区。

如果您的数据不适合单台计算机，则会失败。
仅当rdd1具有单个分区时才有效。当您有更多数据时，这可能不再有效。
repartition执行随机播放，因此您的数据最终可能会以不同方式排序。

我认为正确的解决方案是放弃使用zip，因为您可能无法确保分区匹配。创建密钥并改为使用join：

val indexedRDD1 = rdd1.zipWithIndex.map { case (v, i) => i -> v }
val indexedRDD2 = rdd2.zipWithIndex.map { case (v, i) => i -> v }
val offset = rdd2.count
val indexedRDD3 = rdd3.zipWithIndex.map { case (v, i) => (i + offset) -> v }
val combined =
  indexedRDD1.leftOuterJoin(indexedRDD2).leftOuterJoin(indexedRDD3).map {
    case (i, ((v1, v2Opt), v3Opt)) => i -> (v1, v2Opt.getOrElse(v3Opt.get))
  }

无论分区如何，这都会有效。如果您愿意，可以对结果进行排序并删除最后的索引：

val unindexed = combined.sortByKey().values

Can&#39;t zip RDDs with unequal numbers of partitions

1 个答案:

Can't zip RDDs with unequal numbers of partitions