PySpark cogroup在多个键上

时间:2016-06-03 12:35:34

标签: apache-spark pyspark

鉴于两个列表,我想根据前两个键的共同出现对它们进行分组:

x=[(1,(2,'cat')),(4,(5,'dog'))]
y=[(1,(2,'hairBall')),(4,(5,'woof'))]

期望的输出:

z=[(1,2,('cat','hairBall')),(4,5,('dog','woof'))]

到目前为止我尝试了什么:

sc=SparkContext()
xs=sc.parallelize(x)
ys=sc.parallelize(y)

zs_temp=xs.cogroup(ys)

这导致:

zs_temp.collect()=[(1, [[(2, 'cat')], [(2, 'hairBall')]]), (4, [[(5, 'dog')], [(5, 'woof')]])]

尝试解决方案:

zs_temp.map(lambda f: f[1].cogroup(f[1]) ).collect()

但得到错误:

AttributeError: 'tuple' object has no attribute 'cogroup'

1 个答案:

答案 0 :(得分:2)

测试数据:

x=[(1,(2,'cat')),(4,(5,'dog'))]
y=[(1,(2,'hairBall')),(4,(5,'woof'))]
xs=sc.parallelize(x)
ys=sc.parallelize(y)

更改按键的功能

def reKey(r):
    return ((r[0], r[1][0]), r[1][1])

更改密钥

xs2 = xs.map(reKey)
ys2 = ys.map(reKey)

加入数据,收集结果

results = ys2.join(xs2)
results.collect()
  

[((1,2),(' hairBall',' cat')),((4,5),(' woof',& #39;狗'))]