我的虚拟数据帧如下:
+--------+------+------+------+------+
| item | p1 | p2 | p3 | p4 |
|--------+------+------+------+------|
| a | 1 | 0 | 1 | 1 |
| b | 0 | 1 | 1 | 0 |
| c | 1 | 0 | 1 | 1 |
| d | 0 | 0 | 0 | 1 |
| e | 1 | 0 | 1 | 1 |
| f | 1 | 1 | 1 | 1 |
| g | 1 | 0 | 0 | 0 |
+--------+------+------+------+------+
我想找到组合使用或不组合使用参数p1,p2,p3,p4
的方式。预期结果如下:
+--------+------+--------+--------+--------+
| Length | P-groups(s) | Count | Items |
+--------+---------------+--------+--------+
| 1 | p1 | 1 | g |
| | p4 | 1 | d |
| | | | |
| 2 | p2,p3 | 1 | b |
| | | | |
| 3 | p1,p2,p3 | 3 | [a,c,e]|
| | | | |
| 4 | p1,p2,p3,p4 | 1 | f |
+--------+---------------+--------+--------+
所以,我的原始代码如下:
import pandas as pd
from itertools import chain, combinations
df= pd.DataFrame({'item': ['a','b','c','d','e','f','g'],
'p1': [1,0,1,0,1,1,1],
'p2': [0,1,0,0,0,1,0],
'p3': [1,1,1,0,1,1,0],
'p4': [1,0,1,1,1,1,0]})
def all_subsets(ss):
return chain(*map(lambda x: combinations(ss, x), range(0, len(ss)+1)))
subsets = []
for subset in all_subsets(list(df)[1:]):
subsets.append(list(subset))
for grp in subsets[1:]: #subset[1:] is to exclude empty set
print df.groupby(grp).size().reset_index().rename(columns={0:'count'})
我想知道是否有任何熊猫方法可以达到预期效果?
答案 0 :(得分:2)
将pd.groupby
与pd.filter
一起使用:
import pandas as pd
tmp = df.filter(like='p')
new = tmp.replace(1, pd.Series(tmp.columns, tmp.columns)).copy(deep=True)
df['length'] = tmp.sum(1)
df['groups'] = new.apply(lambda x:','.join(s for s in x if s), 1)
gdf = df.groupby(['length', 'groups'])['item'].agg(['count', list])
print(gdf)
输出:
count list
length groups
1 p1 1 [g]
p4 1 [d]
2 p2,p3 1 [b]
3 p1,p3,p4 3 [a, c, e]
4 p1,p2,p3,p4 1 [f]
如果要打开gdf['list']
的包装,请添加以下行:
gdf['list'] = [l[0] if len(l)==1 else l for l in gdf['list']]
其效果与期望的输出相同:
count list
length groups
1 p1 1 g
p4 1 d
2 p2,p3 1 b
3 p1,p3,p4 3 [a, c, e]
4 p1,p2,p3,p4 1 f