熊猫箱和计数

时间:2016-08-18 10:18:31

标签: python pandas count histogram bin

我是Pandas的新手,请不要太苛刻;)我们假设我的初始数据框看起来像这样:

#::: initialize dictionary
np.random.seed(0)
d = {}
d['size'] = 2 * np.random.randn(100) + 3
d['flag_A'] = np.random.randint(0,2,100).astype(bool)
d['flag_B'] = np.random.randint(0,2,100).astype(bool)
d['flag_C'] = np.random.randint(0,2,100).astype(bool)

#::: convert dictionary into pandas dataframe
df = pd.DataFrame(d)

我现在根据'size'

对数据框进行分区
#::: bin pandas dataframe per size
bins = np.arange(0,10,1)
groups = df.groupby( pd.cut( df['size'], bins ) )

导致此输出:

---
(0, 1]
   flag_A flag_B flag_C      size
25  False  False   True  0.091269
40   True   True   True  0.902894
41   True   True   True  0.159964
46  False   True   True  0.494409
53  False   True   True  0.638736
73   True  False   True  0.530348
80   True  False  False  0.669700
88   True   True   True  0.858495
---
(1, 2]
   flag_A flag_B flag_C      size
...

我现在的问题是:我怎样才能从这里开始每个标志的每个标志(A,B,C)得到真假的计数?例如。 for bin =(0,1)我希望得到像N_flag_A_true = 5,N_flag_A_false = 3等等。理想情况下,我希望通过扩展这个数据框或新数据框来总结这些信息。 / p>

2 个答案:

答案 0 :(得分:3)

可以通过多索引groupbys实现,连接结果和取消堆栈:

flag_A = df.groupby( [pd.cut( df['size'], bins),'flag_A'] ).count()['size'].to_frame()
flag_B = df.groupby( [pd.cut( df['size'], bins),'flag_B'] ).count()['size'].to_frame()
flag_C = df.groupby( [pd.cut( df['size'], bins),'flag_C'] ).count()['size'].to_frame()

T = pd.concat([flag_A,flag_B],axis=1)
R = pd.concat([T,flag_C],axis=1)
R.columns = ['flag_A','flag_B','flag_C']
R.index.names = [u'Bins',u'Value']
R = R.unstack('Value')

结果是:

       flag_A       flag_B       flag_C      
Value   False True   False True   False True 
Bins                                         
(0, 1]    3.0   5.0    3.0   5.0    1.0   7.0
(1, 2]    6.0   8.0    7.0   7.0    5.0   9.0
(2, 3]    7.0   9.0   11.0   5.0   13.0   3.0
(3, 4]   15.0  12.0   12.0  15.0   17.0  10.0
(4, 5]    2.0   8.0    5.0   5.0    7.0   3.0
(5, 6]    5.0   5.0    3.0   7.0    7.0   3.0
(6, 7]    1.0   5.0    NaN   6.0    3.0   3.0
(7, 8]    NaN   2.0    1.0   1.0    NaN   2.0
(8, 9]    NaN   NaN    NaN   NaN    NaN   NaN

编辑:您可以在以下列中解析多索引:

R.columns = ['flag_A_F','flag_A_T','flag_B_F','flag_B_T','flag_C_F','flag_C_T']

结果:

        flag_A_F  flag_A_T  flag_B_F  flag_B_T  flag_C_F  flag_C_T
Bins                                                              
(0, 1]       3.0       5.0       3.0       5.0       1.0       7.0
(1, 2]       6.0       8.0       7.0       7.0       5.0       9.0
(2, 3]       7.0       9.0      11.0       5.0      13.0       3.0
(3, 4]      15.0      12.0      12.0      15.0      17.0      10.0
(4, 5]       2.0       8.0       5.0       5.0       7.0       3.0
(5, 6]       5.0       5.0       3.0       7.0       7.0       3.0
(6, 7]       1.0       5.0       NaN       6.0       3.0       3.0
(7, 8]       NaN       2.0       1.0       1.0       NaN       2.0
(8, 9]       NaN       NaN       NaN       NaN       NaN       NaN

答案 1 :(得分:2)

您可以将您的论坛应用到DF然后pd.melt

df['group'] = pd.cut(df['size'], bins=bins)
melted = pd.melt(df, id_vars='group', value_vars=['flag_A', 'flag_B', 'flag_C'])

哪位能给你:

      group variable  value
0    (6, 7]   flag_A  False
1    (3, 4]   flag_A  False
2    (4, 5]   flag_A   True
3    (7, 8]   flag_A   True
4    (6, 7]   flag_A   True
5    (1, 2]   flag_A  False
[...]

然后按列分组并获取每个组的大小:

df2 = melted.groupby(['group', 'variable', 'value']).size()

这给了你:

group   variable  value
(0, 1]  flag_A    False     3
                  True      5
        flag_B    False     3
                  True      5
        flag_C    False     1
                  True      7
(1, 2]  flag_A    False     6
                  True      8
        flag_B    False     7
                  True      7
        flag_C    False     5
                  True      9
(2, 3]  flag_A    False     7
                  True      9
        flag_B    False    11
                  True      5
        flag_C    False    13
                  True      3
        [...]

然后你需要重新塑造你想如何使用它......