如何组合三个RDD?

时间:2015-12-04 15:22:20

标签: apache-spark rdd

我有3个带有类似分区键的RDD(比如说idOwner的字符串):

JavaPairRDD<PartitionKey, Iterable<Cat>> rddCat
JavaPairRDD<PartitionKey, Iterable<Dog>> rddDog
JavaPairRDD<PartitionKey, Iterable<Fish>> rddFish

如何转到预期的解决方案:

JavaPairRDD<PartitionKey, Tuple3<Iterable<Cat>, Iterable<Dog>, Iterable<fish>>>

我只设法做到这一点,

失败1:

rddCat.cogroup(rddDog, rddFish)
--> FlatMapFunction<Tuple2<PartitionKey, Tuple3<Iterable<Iterable<Cat>>, Iterable<Iterable<Dog>>, Iterable<Iterable<Fish>>>>

失败2:

JavaPairRDD<PartitionKey, Tuple2<Iterable<Cat>, Iterable<Dog>>> catDogRdd = rddCat.join(rddDog);
JavaPairRDD<PartitionKey, Tuple2<Tuple2<Iterable<Cat>, Iterable<Dog>>, Iterable<Fish>>> finalRdd = catDogRdd.join(rddFish);

2 个答案:

答案 0 :(得分:0)

tl; dr 使用join,即def join[W](other: RDD[(K, W)]): RDD[(K, (V, W))])。

我使用Scala,以下似乎工作正常。

scala> r2.collect
res7: Array[(Int, Iterable[Int])] = Array((0,CompactBuffer(0, 1)), (3,CompactBuffer(6, 7)), (4,CompactBuffer(8, 9)), (1,CompactBuffer(2, 3)), (2,CompactBuffer(4, 5)))

scala> r3.collect
res8: Array[(Int, Iterable[Int])] = Array((0,CompactBuffer(0, 1, 2)), (3,CompactBuffer(9)), (1,CompactBuffer(3, 4, 5)), (2,CompactBuffer(6, 7, 8)))

scala> r5.collect
res9: Array[(Int, Iterable[Int])] = Array((0,CompactBuffer(0, 1, 2, 3, 4)), (1,CompactBuffer(5, 6, 7, 8, 9)))

scala> r2 join r3 join r5 collect
res10: Array[(Int, ((Iterable[Int], Iterable[Int]), Iterable[Int]))] = Array((0,((CompactBuffer(0, 1),CompactBuffer(0, 1, 2)),CompactBuffer(0, 1, 2, 3, 4))), (1,((CompactBuffer(2, 3),CompactBuffer(3, 4, 5)),CompactBuffer(5, 6, 7, 8, 9))))

咨询org.apache.spark.rdd.PairRDDFunctions

答案 1 :(得分:0)

我设法在Guava的帮助下做到了:

    //given
    final JavaPairRDD<Character, Iterable<Integer>> rdd1 = ...
    final JavaPairRDD<Character, Iterable<Integer>> rdd2 = ...
    final JavaPairRDD<Character, Iterable<Integer>> rdd3 = ...

    // when
    final JavaPairRDD<Character, Tuple3<Iterable<Iterable<Integer>>, Iterable<Iterable<Integer>>, Iterable<Iterable<Integer>>>> grouped = rdd1.cogroup(rdd2, rdd3);
    final JavaPairRDD<Character, Tuple3<Iterable<Integer>, Iterable<Integer>, Iterable<Integer>>> flattened = grouped.mapValues(
            t3 -> new Tuple3<>(Iterables.concat(t3._1()), Iterables.concat(t3._2()), Iterables.concat(t3._3()))
    );

我想知道@Fundhor你是如何在第一次尝试中设法产生这个签名的。这似乎不可能。