有没有比这更好的方法来制作分类大熊猫列

时间:2016-06-16 17:06:08

标签: python pandas

我有成对的分类数据,但我不想重复计算“玩具”和“B”例如多次在一起的实例。 我可以使用计数做一个数据透视表,但我想要的是相当于1或0,具体取决于是否匹配2个值的组合,而不是匹配的数量,2,3,4等。

以下是输入示例:

RS232,1.8,focused,C
RS233,2.8,chew,E
RS234,3.8,toy,D
RS235,4.8,poodle,C
RS236,5.8,winding,E
RS237,6.8,up,D
RS238,7.8,focused,B
RS239,9.8,chew,B
RS240,7.8,toy,B
RS241,6.8,toy,B
RS242,5.8,toy,A
RS243,4.8,focused,A
RS244,9.8,chew,A
RS245,8.8,chew,A
RS246,7.8,chew,C
RS247,6.8,winding,C
RS248,5.8,winding,C
RS249,4.8,winding,D
RS250,3.8,toy,D

除了早期的过滤步骤之外,数字字段无关紧要。但是我只想把RS244和RS245计算在条形图中作为单个计数,因为使这个组合两次只是意味着人们尝试了很多,而不是多次出现有任何特殊含义。

我最终得到了我绘制的数据:

    attrib2 group  count
0      chew     A      2
1      chew     B      1
2      chew     C      1
3      chew     E      1
4   focused     A      1
5   focused     B      1
6   focused     C      1
7    poodle     C      1
8       toy     A      1
9       toy     B      2
10      toy     D      2
11       up     D      1
12  winding     C      2
13  winding     D      1
14  winding     E      1

注意重复对的计数> 1,但是为了绘图,我使用.value_counts,所以我忽略了count字段,只是绘制了attrib2的每个元素配对的UNIQUE项目的数量。我想要的直方图只是每个元素在上面的attrib2列中列出的次数。

enter image description here

我这样做的粗暴方式是 - 当然必须有一种更清洁,更加抒情的方法来实现这一目标吗?

import pandas as pd

import matplotlib.pyplot as plt

from matplotlib import interactive

df= pd.read_csv('out.txt',sep=',',engine='c',lineterminator='\n',header='infer')

# # I am getting group/attrib2 pairs, but I want my plot to be against attrib2

groupout3 = df.groupby(['attrib2']).group.value_counts().sort_index()

# # groupby gives multiple counts for same combination, so set to 1 or leave as 0
# # following line not needed since I use value_counts below so it counts 1 if there is something there, regardless of the value, so 1, 2, etc. all get counted as 1 and 0 is 0 
# #groupout3[groupout3 != 0 ] = 1

# #convert back to DataFrame for plotting 
dfgroup = groupout3.to_frame('count')

# #make index back to column name
dfgroup.reset_index(level=['group','attrib2'], inplace=True)

# #plot categorical data counting 

plt.figure(); dfgroup.attrib2.value_counts().plot(kind='bar')

plt.show()
肯定有更优雅的方式来做到这一点? 谢谢!

1 个答案:

答案 0 :(得分:1)

IIUC你可以这样做:

(df.groupby(['attrib2','group'])
   .size()
   .reset_index()
   .groupby('attrib2')
   .size()
   .plot.bar(rot=0)
)

enter image description here

数据:

In [85]: df
Out[85]:
   attrib  num  attrib2 group
0   RS232  1.8  focused     C
1   RS233  2.8     chew     E
2   RS234  3.8      toy     D
3   RS235  4.8   poodle     C
4   RS236  5.8  winding     E
5   RS237  6.8       up     D
6   RS238  7.8  focused     B
7   RS239  9.8     chew     B
8   RS240  7.8      toy     B
9   RS241  6.8      toy     B
10  RS242  5.8      toy     A
11  RS243  4.8  focused     A
12  RS244  9.8     chew     A
13  RS245  8.8     chew     A
14  RS246  7.8     chew     C
15  RS247  6.8  winding     C
16  RS248  5.8  winding     C
17  RS249  4.8  winding     D
18  RS250  3.8      toy     D