我有一个熊猫数据框,如下所示:
Client 1_act 2_act 3_act 4_act 5_act 6_act ...
1 hiking swimming skating jumping climbing eating
2 eating hiking climbing exploring
3 hiking exercising
4 hiking screaming yelling hopping swimming
...
每行仅具有唯一的活动,并且可以具有许多都称为#_act的列(也可以在客户端报告新活动时随时添加新列)。每行至少有一对(每行至少有2个活动)。也可以随时添加新的活动值。
我正在尝试找到一种方法来返回最常见的活动对。因此所需的输出将是:
Pair Qty
(hiking, swimming) 2
(hiking, skating) 1
(hiking, jumping) 1
(hiking, climbing) 2
(hiking, eating) 2
(swimming, skating) 1
(swimming, jumping) 1
(swimming, climbing) 1
(swimming, eating) 1
(skating, jumping) 1
(skating, climbing) 1
(skating, eating) 1
(jumping, climbing) 1
(climbing, eating) 2
(eating, exploring) 1
(hiking, exercising) 1
(hiking, screaming) 1
(hiking, yelling) 1
(hiking, hopping) 1
...
上面的输出是此示例数据集中所有行中所有可能的对的示例输出。如果一对在随后的行中重复出现,则应该增加数量,如果在随后的行中出现新的一对,则应该将其作为新行添加到成对的列中。
这样做的目的是查看所有客户中最常见的一对活动。任何帮助将不胜感激!谢谢!!
答案 0 :(得分:1)
在列表理解中使用combinations
进行展平,按Counter
计算元组,然后传递给DataFrame
构造函数:
from collections import Counter
from itertools import combinations
df = df.set_index('Client')
c = Counter([y for x in df.values for y in combinations(x, 2)])
df = pd.DataFrame({'Pair': list(c.keys()), 'Qty': list(c.values())})
对于顶级组合:
n = 10
L = Counter([y for x in df.values for y in combinations(x, 2)]).most_common(n)
df = pd.DataFrame(L, columns=['Pair', 'Qty'])
print (df)
Pair Qty
0 (hiking, swimming) 2
1 (hiking, climbing) 2
2 (hiking, eating) 2
3 (swimming, eating) 2
4 (hiking, hopping) 2
5 (hiking, skating) 1
6 (hiking, jumping) 1
7 (swimming, skating) 1
8 (swimming, jumping) 1
9 (swimming, climbing) 1