Question

//我使用Spark 2.01 //

我的数据看起来像，

(K1,Array(V1,V2,V3.....V30))
(K2,Array(V1,V2,V3.....V30))
(K3,Array(V1,V2,V3.....V30))
...
(K3704, Array(V1,V2,V3.....V30))

我想为每个键的值创建一个笛卡尔积列表值。

(K1, (V1,V2),(V1,V3),(V1,V4) ...
(K2, (V2,V3),(V2,V4),(V2,V5) ...
...
//PS. there are no duplicate elements like (V1,V2) == (V2,V1)

我认为会有30个！每个键的操作，但如果可以优化它会更好。

Answer 1

在Python中，我们可以使用combinations()内的itertools个包中的mapValues()函数：

from itertools import combinations
rdd.mapValues(lambda x: list(combinations(x, 2)))

在Scala中，我们可以以类似的方式使用combinations()方法。但是因为它只是摄取和输出对象类型Seq，所以我们必须将更多方法链接在一起以达到您预期的格式：

rdd.mapValues(_.toSeq.combinations(2).toArray.map{case Seq(x,y) => (x,y)})