基于密钥的DF上的pySpark COGROUP操作

时间:2017-07-13 13:39:33

标签: pyspark

我想分别使用密钥coGroupABA_key两个关系执行B_key操作。

我尝试通过对各个关系执行groupBy操作然后加入它们来做到这一点,但是正如我发现的那样,在PySpark DF的情况下,你无法对分组数据执行连接操作。

1 个答案:

答案 0 :(得分:1)

来自pyspark api文档,http://spark.apache.org/docs/1.6.1/api/python/pyspark.html

cogroup(other, numPartitions=None)
For each key k in self or other, return a resulting RDD that contains a tuple with the list of values for that key in self as well as other.
>>> x = sc.parallelize([("a", 1), ("b", 4)])
>>> y = sc.parallelize([("a", 2)])
>>> [(x, tuple(map(list, y))) for x, y in sorted(list(x.cogroup(y).collect()))]
[('a', ([1], [2])), ('b', ([4], []))]