Question

我有成对的分类数据，但我不想重复计算“玩具”和“B”例如多次在一起的实例。我可以使用计数做一个数据透视表，但我想要的是相当于1或0，具体取决于是否匹配2个值的组合，而不是匹配的数量，2,3,4等。

以下是输入示例：

RS232,1.8,focused,C
RS233,2.8,chew,E
RS234,3.8,toy,D
RS235,4.8,poodle,C
RS236,5.8,winding,E
RS237,6.8,up,D
RS238,7.8,focused,B
RS239,9.8,chew,B
RS240,7.8,toy,B
RS241,6.8,toy,B
RS242,5.8,toy,A
RS243,4.8,focused,A
RS244,9.8,chew,A
RS245,8.8,chew,A
RS246,7.8,chew,C
RS247,6.8,winding,C
RS248,5.8,winding,C
RS249,4.8,winding,D
RS250,3.8,toy,D

除了早期的过滤步骤之外，数字字段无关紧要。但是我只想把RS244和RS245计算在条形图中作为单个计数，因为使这个组合两次只是意味着人们尝试了很多，而不是多次出现有任何特殊含义。

我最终得到了我绘制的数据：

    attrib2 group  count
0      chew     A      2
1      chew     B      1
2      chew     C      1
3      chew     E      1
4   focused     A      1
5   focused     B      1
6   focused     C      1
7    poodle     C      1
8       toy     A      1
9       toy     B      2
10      toy     D      2
11       up     D      1
12  winding     C      2
13  winding     D      1
14  winding     E      1

注意重复对的计数＆gt; 1，但是为了绘图，我使用.value_counts，所以我忽略了count字段，只是绘制了attrib2的每个元素配对的UNIQUE项目的数量。我想要的直方图只是每个元素在上面的attrib2列中列出的次数。

我这样做的粗暴方式是 - 当然必须有一种更清洁，更加抒情的方法来实现这一目标吗？

import pandas as pd

import matplotlib.pyplot as plt

from matplotlib import interactive

df= pd.read_csv('out.txt',sep=',',engine='c',lineterminator='\n',header='infer')

# # I am getting group/attrib2 pairs, but I want my plot to be against attrib2

groupout3 = df.groupby(['attrib2']).group.value_counts().sort_index()

# # groupby gives multiple counts for same combination, so set to 1 or leave as 0
# # following line not needed since I use value_counts below so it counts 1 if there is something there, regardless of the value, so 1, 2, etc. all get counted as 1 and 0 is 0 
# #groupout3[groupout3 != 0 ] = 1

# #convert back to DataFrame for plotting 
dfgroup = groupout3.to_frame('count')

# #make index back to column name
dfgroup.reset_index(level=['group','attrib2'], inplace=True)

# #plot categorical data counting 

plt.figure(); dfgroup.attrib2.value_counts().plot(kind='bar')

plt.show()

肯定有更优雅的方式来做到这一点？谢谢！

Answer 1

IIUC你可以这样做：

(df.groupby(['attrib2','group'])
   .size()
   .reset_index()
   .groupby('attrib2')
   .size()
   .plot.bar(rot=0)
)

数据：

In [85]: df
Out[85]:
   attrib  num  attrib2 group
0   RS232  1.8  focused     C
1   RS233  2.8     chew     E
2   RS234  3.8      toy     D
3   RS235  4.8   poodle     C
4   RS236  5.8  winding     E
5   RS237  6.8       up     D
6   RS238  7.8  focused     B
7   RS239  9.8     chew     B
8   RS240  7.8      toy     B
9   RS241  6.8      toy     B
10  RS242  5.8      toy     A
11  RS243  4.8  focused     A
12  RS244  9.8     chew     A
13  RS245  8.8     chew     A
14  RS246  7.8     chew     C
15  RS247  6.8  winding     C
16  RS248  5.8  winding     C
17  RS249  4.8  winding     D
18  RS250  3.8      toy     D

有没有比这更好的方法来制作分类大熊猫列

1 个答案: