Question

我有一对@Override protected void onActivityResult(int requestCode, int resultCode, Intent data) { super.onActivityResult(requestCode, resultCode, data); // check if the request code is same as what is passed here it is 2 if(requestCode==2) { if(data.getBooleanExtra("isdeleted")){ remove from position array and notify dataset change. // pos = data.getIntExtra("pos") } } }对形式的列表：

(key,value)

我想计算每个值元组与键元组一起出现的次数。

期望的输出：

x=[(('cat','dog),('a','b')),(('cat','dog'),('a','b')),(('mouse','rat'),('e','f'))]

一个有效的解决方案是：

[(('cat','dog'),('a','b',2)),(('mouse','rat'),('e','f',1))]

但是对于大型数据集，此方法会填满磁盘空间（~600GB）。我试图使用xs=sc.parallelize(x) xs=xs.groupByKey() xs=xs.map(lambda (x,y):(x,Counter(y))实现类似的解决方案：

reduceByKey

但是我收到以下错误：

xs=xs.reduceByKey(Counter).collect()

Answer 1

以下是我通常的做法：

xs=sc.parallelize(x)
a = xs.map(lambda x: (x, 1)).reduceByKey(lambda a,b: a+b)

a.collect()收益：

[((('mouse', 'rat'), ('e', 'f')), 1), ((('cat', 'dog'), ('a', 'b')), 2)]

我将假设你想要（key1，key2）对中第二个键内的计数（此处为1和2）。

要实现这一目标，请尝试以下方法：

a.map(lambda x: (x[0][0], x[0][1] + (x[1],))).collect()

最后一步基本上重新映射它，以便您获得第一个密钥对（如('mouse','rat')），然后获取第二个密钥对（如('e','f')），然后添加tuple版本b[1]，这是第二个密钥对的计数。

Pyspark用键计算值的出现

1 个答案: