Question

鉴于两个列表，我想根据前两个键的共同出现对它们进行分组：

x=[(1,(2,'cat')),(4,(5,'dog'))]
y=[(1,(2,'hairBall')),(4,(5,'woof'))]

期望的输出：

z=[(1,2,('cat','hairBall')),(4,5,('dog','woof'))]

到目前为止我尝试了什么：

sc=SparkContext()
xs=sc.parallelize(x)
ys=sc.parallelize(y)

zs_temp=xs.cogroup(ys)

这导致：

zs_temp.collect()=[(1, [[(2, 'cat')], [(2, 'hairBall')]]), (4, [[(5, 'dog')], [(5, 'woof')]])]

尝试解决方案：

zs_temp.map(lambda f: f[1].cogroup(f[1]) ).collect()

但得到错误：

AttributeError: 'tuple' object has no attribute 'cogroup'

Answer 1

测试数据：

x=[(1,(2,'cat')),(4,(5,'dog'))]
y=[(1,(2,'hairBall')),(4,(5,'woof'))]
xs=sc.parallelize(x)
ys=sc.parallelize(y)

更改按键的功能

def reKey(r):
    return ((r[0], r[1][0]), r[1][1])

更改密钥

xs2 = xs.map(reKey)
ys2 = ys.map(reKey)

加入数据，收集结果

results = ys2.join(xs2)
results.collect()

[（（1,2），（＆＃39; hairBall＆＃39;，＆＃39; cat＆＃39;）），（（4,5），（＆＃39; woof＆＃39;，＆＃39;狗＆＃39;））]