假设我在Pandas中有一个多索引数据框,其中包含多个级别,如下所示:
A B C
X Y Z
bar one a -0.007381 -0.365315 -0.024817
b -1.219794 0.370955 -0.795125
baz three a 0.145578 1.428502 -0.408384
b -0.249321 -0.292967 -1.849202
two a -0.249321 -0.292967 -1.849202
four a 0.21 -0.967123 1.202234
foo one b -1.046479 -1.250595 0.781722
a 1.314373 0.333150 0.133331
qux one c 0.716789 0.616471 -0.298493
two b 0.385795 -0.915417 -1.367644
我想知道:
每个级别的每个值的叶子大小 。在上面的例子中,这将是:
bar: 2 bar & one: 2 bar & one & a: 1 bar & one & b: 1 baz: 4 baz & three: 2 baz & three & a: 1 baz & three & b: 1 etc.
连续级别之间的相对大小。在上面的例子中,这将是:
# First level -> Second level : bar: 1 (i.e. grouping ["one"]) baz: 3 (i.e. grouping ["three", two", "four"]) foo: 1 (i.e. grouping ["one"]) qux: 2 (i.e. grouping ["one", "two"]) # Second level -> Third level ... # Third level -> Fourth level (if we had one) etc.
在Pandas中有没有办法做到这一点,并且(最好)也在数据框中得到结果?
答案 0 :(得分:2)
好吧,因为你添加了另一部分,我将充实我的答案。为了做第1部分,我将使用列表推导来循环不同的groupby级别并获得所有组的大小。然后concat
将每个groupby的结果数据框组合在一起:
print pd.concat([df.groupby(level=x).size() for x in [0,[0,1],[0,1,2]]])
bar 2
baz 4
foo 2
qux 2
(bar, one) 2
(baz, four) 1
(baz, three) 2
(baz, two) 1
(foo, one) 2
(qux, one) 1
(qux, two) 1
(bar, one, a) 1
(bar, one, b) 1
(baz, four, a) 1
(baz, three, a) 1
(baz, three, b) 1
(baz, two, a) 1
(foo, one, a) 1
(foo, one, b) 1
(qux, one, c) 1
(qux, two, b) 1
第2部分更复杂,但我认为我们可以使用相同的结构。可能有很多种方法,但我会在相同的基本列表理解中使用ngroups方法:
def group_count(df,x):
by = df['A'].groupby(level=x[0])
return by.agg(lambda g: g.groupby(level=x[1]).ngroups)
lvl = [0,[0,1],[0,1,2]]
print pd.concat([group_count(df,x) for x in zip(lvl[:-1],lvl[1:])])
bar 1
baz 3
foo 1
qux 2
(bar, one) 2
(baz, four) 1
(baz, three) 2
(baz, two) 1
(foo, one) 2
(qux, one) 1
(qux, two) 1
当然你可能不喜欢索引作为元组;如果您愿意,可以重置列表推导中的索引以获得以下内容(例如,如果是第1部分):
lvl = [0,[0,1],[0,1,2]]
print pd.concat([df.groupby(level=x).size().reset_index() for x in lvl])
0 X Y Z
0 2 bar NaN NaN
1 4 baz NaN NaN
2 2 foo NaN NaN
3 2 qux NaN NaN
0 2 bar one NaN
1 1 baz four NaN
2 2 baz three NaN
3 1 baz two NaN
4 2 foo one NaN
5 1 qux one NaN
6 1 qux two NaN
0 1 bar one a
1 1 bar one b
2 1 baz four a
3 1 baz three a
4 1 baz three b
5 1 baz two a
6 1 foo one a
7 1 foo one b
8 1 qux one c
9 1 qux two b
答案 1 :(得分:1)
也许有更直接的方法,但这可以通过获取索引的值来实现:
In [50]:
df.index.tolist()
Out[50]:
[('bar', 'one', 'a'),
('bar', 'one', 'b'),
('baz', 'three', 'a'),
('baz', 'three', 'b'),
('baz', 'two', 'a'),
('baz', 'four', 'a'),
('foo', 'one', 'b'),
('foo', 'one', 'a'),
('qux', 'one', 'c'),
('qux', 'two', 'b')]
In [53]:
len([item for item in df.index.tolist() if item[0]=='bar'])
Out[53]:
2
In [54]:
len([item for item in df.index.tolist() if (item[0]=='bar')&(item[1]=='one')])
Out[54]:
2
或矢量化:
In [71]:
A=np.asanyarray(df.index.tolist())
In [72]:
(A[:,:2]==np.array(['bar', 'one'])).all(1).sum()
Out[72]:
2
In [73]:
(A[:,:3]==np.array(['baz','three','b'])).all(1).sum()
Out[73]:
1