鉴于此DataFrame:
bowl cookie
0 one chocolate
1 two chocolate
2 two chocolate
3 two vanilla
4 one vanilla
5 one vanilla
6 one vanilla
7 one vanilla
8 one vanilla
9 two chocolate
我想获得以下汇总的DataFrame:
vanilla chocolate
one 5 1
two 1 3
除了手动进行:
vanilla_bowl1 = len(df_picks[(df_picks['bowl'] == 'one') & (df_picks['cookie'] == 'vanilla')])
vanilla_bowl2 = len(df_picks[(df_picks['bowl'] == 'two') & (df_picks['cookie'] == 'vanilla')])
chocolate_bowl1 = ...
chocolate_bowl2 = ...
有没有办法在Pandas
的单一操作中执行此操作?
注意:我查看了df.pivot()
,如果我在每行中添加等于count
的列1
,这将会有效:
bowl cookie count
0 one chocolate 1
1 two chocolate 1
2 two chocolate 1
3 two vanilla 1
4 one vanilla 1
5 one vanilla 1
6 one vanilla 1
7 one vanilla 1
8 one vanilla 1
9 two chocolate 1
然后
df.pivot(index='bowl', columns='cookie', values='count')
但是,我想知道是否有更直接的方法,首先不需要添加count
列。
答案 0 :(得分:3)
最简洁的方法可能是pandas.crosstab
函数:
>>> pandas.crosstab(d.bowl, d.cookie)
cookie chocolate vanilla
bowl
one 1 5
two 3 1
答案 1 :(得分:2)
您可以使用pivot_table()方法:
In [33]: df.pivot_table(index='bowl', columns='cookie', aggfunc='size', fill_value=0)
Out[33]:
cookie chocolate vanilla
bowl
one 1 5
two 3 1
或者您可以使用groupby()
,size()
和unstack()
- pivot_table()
如何做到这一点:
In [36]: df.groupby(['bowl', 'cookie']).size().unstack('cookie', fill_value=0)
Out[36]:
cookie chocolate vanilla
bowl
one 1 5
two 3 1
100K行的时间DF:
In [48]: big = pd.concat([df] * 10**4, ignore_index=True)
In [49]: big.shape
Out[49]: (100000, 2)
In [50]: %timeit pd.crosstab(big.bowl, big.cookie)
10 loops, best of 3: 58 ms per loop
In [51]: %timeit big.pivot_table(index='bowl', columns='cookie', aggfunc='size', fill_value=0)
10 loops, best of 3: 38.4 ms per loop
In [52]: %timeit big.groupby(['bowl', 'cookie']).size().unstack('cookie', fill_value=0)
10 loops, best of 3: 34.2 ms per loop
In [118]: %timeit pir(big)
1 loop, best of 3: 631 ms per loop
In [119]: big.shape
Out[119]: (100000, 2)
1M行的时间DF:
In [53]: big = pd.concat([big] * 10, ignore_index=True)
In [54]: big.shape
Out[54]: (1000000, 2)
In [55]: %timeit pd.crosstab(big.bowl, big.cookie)
1 loop, best of 3: 446 ms per loop
In [56]: %timeit big.pivot_table(index='bowl', columns='cookie', aggfunc='size', fill_value=0)
1 loop, best of 3: 333 ms per loop
In [57]: %timeit big.groupby(['bowl', 'cookie']).size().unstack('cookie', fill_value=0)
1 loop, best of 3: 327 ms per loop
In [121]: %timeit pir(big)
1 loop, best of 3: 7.08 s per loop
In [122]: big.shape
Out[122]: (1000000, 2)
答案 2 :(得分:1)
一种笨拙的方法
from itertools import product
import pandas as pd
import numpy as np
def pir(df):
ub = pd.Index(np.unique(df.values[:, 0]), name='bowl')
uc = pd.Index(np.unique(df.values[:, 1]), name='cookie')
u = np.array(list(product(ub.values, uc.values)))
e = u[:, None] == df.values
return pd.DataFrame(
e.all(2).sum(1).reshape(-1, 2),
ub, uc
)
pir(df)