鉴于两个列表,我想根据前两个键的共同出现对它们进行分组:
x=[(1,(2,'cat')),(4,(5,'dog'))]
y=[(1,(2,'hairBall')),(4,(5,'woof'))]
期望的输出:
z=[(1,2,('cat','hairBall')),(4,5,('dog','woof'))]
到目前为止我尝试了什么:
sc=SparkContext()
xs=sc.parallelize(x)
ys=sc.parallelize(y)
zs_temp=xs.cogroup(ys)
这导致:
zs_temp.collect()=[(1, [[(2, 'cat')], [(2, 'hairBall')]]), (4, [[(5, 'dog')], [(5, 'woof')]])]
尝试解决方案:
zs_temp.map(lambda f: f[1].cogroup(f[1]) ).collect()
但得到错误:
AttributeError: 'tuple' object has no attribute 'cogroup'
答案 0 :(得分:2)
测试数据:
x=[(1,(2,'cat')),(4,(5,'dog'))]
y=[(1,(2,'hairBall')),(4,(5,'woof'))]
xs=sc.parallelize(x)
ys=sc.parallelize(y)
更改按键的功能
def reKey(r):
return ((r[0], r[1][0]), r[1][1])
更改密钥
xs2 = xs.map(reKey)
ys2 = ys.map(reKey)
加入数据,收集结果
results = ys2.join(xs2)
results.collect()
[((1,2),(' hairBall',' cat')),((4,5),(' woof',& #39;狗'))]