Pandas中的分层组大小

时间:2014-05-01 03:01:36

标签: python pandas

假设我在Pandas中有一个多索引数据框,其中包含多个级别,如下所示:

                     A         B         C
X      Y     Z                          
bar   one    a   -0.007381 -0.365315 -0.024817
             b   -1.219794  0.370955 -0.795125
baz   three  a    0.145578  1.428502 -0.408384
             b   -0.249321 -0.292967 -1.849202
      two    a   -0.249321 -0.292967 -1.849202
      four   a    0.21     -0.967123  1.202234
foo   one    b   -1.046479 -1.250595  0.781722
             a    1.314373  0.333150  0.133331
qux   one    c    0.716789  0.616471 -0.298493
      two    b    0.385795 -0.915417 -1.367644

我想知道:

  1. 每个级别的每个值的叶子大小 。在上面的例子中,这将是:

    bar: 2
    bar & one: 2
    bar & one & a: 1
    bar & one & b: 1
    baz: 4
    baz & three: 2
    baz & three & a: 1
    baz & three & b: 1 
    etc.
    
  2. 连续级别之间的相对大小。在上面的例子中,这将是:

    # First level -> Second level :
    bar: 1 (i.e. grouping ["one"])
    baz: 3 (i.e. grouping ["three", two", "four"])
    foo: 1 (i.e. grouping ["one"])
    qux: 2 (i.e. grouping ["one", "two"])
    
    # Second level -> Third level
    ... 
    
    # Third level -> Fourth level (if we had one)
    etc.
    
  3. 在Pandas中有没有办法做到这一点,并且(最好)也在数据框中得到结果?

2 个答案:

答案 0 :(得分:2)

好吧,因为你添加了另一部分,我将充实我的答案。为了做第1部分,我将使用列表推导来循环不同的groupby级别并获得所有组的大小。然后concat将每个groupby的结果数据框组合在一起:

print pd.concat([df.groupby(level=x).size() for x in [0,[0,1],[0,1,2]]])

bar                2
baz                4
foo                2
qux                2
(bar, one)         2
(baz, four)        1
(baz, three)       2
(baz, two)         1
(foo, one)         2
(qux, one)         1
(qux, two)         1
(bar, one, a)      1
(bar, one, b)      1
(baz, four, a)     1
(baz, three, a)    1
(baz, three, b)    1
(baz, two, a)      1
(foo, one, a)      1
(foo, one, b)      1
(qux, one, c)      1
(qux, two, b)      1

第2部分更复杂,但我认为我们可以使用相同的结构。可能有很多种方法,但我会在相同的基本列表理解中使用ngroups方法:

def group_count(df,x):
    by = df['A'].groupby(level=x[0])
    return by.agg(lambda g: g.groupby(level=x[1]).ngroups)

lvl = [0,[0,1],[0,1,2]]
print pd.concat([group_count(df,x) for x in zip(lvl[:-1],lvl[1:])])

bar             1
baz             3
foo             1
qux             2
(bar, one)      2
(baz, four)     1
(baz, three)    2
(baz, two)      1
(foo, one)      2
(qux, one)      1
(qux, two)      1

当然你可能不喜欢索引作为元组;如果您愿意,可以重置列表推导中的索引以获得以下内容(例如,如果是第1部分):

lvl = [0,[0,1],[0,1,2]]
print pd.concat([df.groupby(level=x).size().reset_index() for x in lvl])

   0    X      Y    Z
0  2  bar    NaN  NaN
1  4  baz    NaN  NaN
2  2  foo    NaN  NaN
3  2  qux    NaN  NaN
0  2  bar    one  NaN
1  1  baz   four  NaN
2  2  baz  three  NaN
3  1  baz    two  NaN
4  2  foo    one  NaN
5  1  qux    one  NaN
6  1  qux    two  NaN
0  1  bar    one    a
1  1  bar    one    b
2  1  baz   four    a
3  1  baz  three    a
4  1  baz  three    b
5  1  baz    two    a
6  1  foo    one    a
7  1  foo    one    b
8  1  qux    one    c
9  1  qux    two    b

答案 1 :(得分:1)

也许有更直接的方法,但这可以通过获取索引的值来实现:

In [50]:

df.index.tolist()
Out[50]:
[('bar', 'one', 'a'),
 ('bar', 'one', 'b'),
 ('baz', 'three', 'a'),
 ('baz', 'three', 'b'),
 ('baz', 'two', 'a'),
 ('baz', 'four', 'a'),
 ('foo', 'one', 'b'),
 ('foo', 'one', 'a'),
 ('qux', 'one', 'c'),
 ('qux', 'two', 'b')]
In [53]:

len([item for item in df.index.tolist() if item[0]=='bar'])
Out[53]:
2
In [54]:

len([item for item in df.index.tolist() if (item[0]=='bar')&(item[1]=='one')])
Out[54]:
2

或矢量化:

In [71]:

A=np.asanyarray(df.index.tolist())
In [72]:

(A[:,:2]==np.array(['bar', 'one'])).all(1).sum()
Out[72]:
2
In [73]:

(A[:,:3]==np.array(['baz','three','b'])).all(1).sum()
Out[73]:
1