我有这个数据框:
source target
0 ape dog
1 ape hous
2 dog hous
3 hors dog
4 hors ape
5 dog ape
6 ape bird
7 ape hous
8 bird hous
9 bird fist
10 bird ape
11 fist ape
我正在尝试使用以下代码生成频率计数:
df_count =df.groupby(['source', 'target']).size().reset_index().sort_values(0, ascending=False)
df_count.columns = ['source', 'target', 'weight']
我得到以下结果。
source target weight
2 ape hous 2
0 ape bird 1
1 ape dog 1
3 bird ape 1
4 bird fist 1
5 bird hous 1
6 dog ape 1
7 dog hous 1
8 fist ape 1
9 hors ape 1
10 hors dog 1
如何修改代码以使方向无关紧要,即代替ape bird 1
和bird ape 1
,我得到ape bird 2
?
答案 0 :(得分:5)
首先按行排序值。
In [31]: df
Out[31]:
source target
0 ape dog
1 ape hous
2 dog hous
3 hors dog
4 hors ape
5 dog ape
6 ape bird
7 ape hous
8 bird hous
9 bird fist
10 bird ape
11 fist ape
In [32]: df.values.sort()
In [33]: df
Out[33]:
source target
0 ape dog
1 ape hous
2 dog hous
3 dog hors
4 ape hors
5 ape dog
6 ape bird
7 ape hous
8 bird hous
9 bird fist
10 ape bird
11 ape fist
然后,groupby
source, target
,按大小汇总,sort
结果。
In [34]: df.groupby(['source', 'target']).size().sort_values(ascending=False)
...: .reset_index(name='weight')
Out[34]:
source target weight
0 ape hous 2
1 ape dog 2
2 ape bird 2
3 dog hous 1
4 dog hors 1
5 bird hous 1
6 bird fist 1
7 ape hors 1
8 ape fist 1
答案 1 :(得分:4)
您可以先按apply
按行排序,然后将参数name
添加到reset_index
:
df_count = df.apply(sorted, axis=1) \
.groupby(['source', 'target']) \
.size() \
.reset_index(name='weight') \
.sort_values('weight', ascending=False)
print (df_count)
source target weight
0 ape bird 2
1 ape dog 2
4 ape hous 2
2 ape fist 1
3 ape hors 1
5 bird fist 1
6 bird hous 1
7 dog hors 1
8 dog hous 1