Pandas - 从行计数

时间:2016-11-11 21:37:11

标签: python pandas numpy dataframe

鉴于此DataFrame:

    bowl    cookie
0   one     chocolate
1   two     chocolate
2   two     chocolate
3   two     vanilla
4   one     vanilla
5   one     vanilla
6   one     vanilla
7   one     vanilla
8   one     vanilla
9   two     chocolate

我想获得以下汇总的DataFrame:

        vanilla     chocolate
one     5           1
two     1           3

除了手动进行:

vanilla_bowl1 = len(df_picks[(df_picks['bowl'] == 'one') & (df_picks['cookie'] == 'vanilla')])
vanilla_bowl2 = len(df_picks[(df_picks['bowl'] == 'two') & (df_picks['cookie'] == 'vanilla')])
chocolate_bowl1 = ...
chocolate_bowl2 = ...

有没有办法在Pandas的单一操作中执行此操作?

注意:我查看了df.pivot(),如果我在每行中添加等于count的列1,这将会有效:

    bowl    cookie      count
0   one     chocolate       1
1   two     chocolate       1
2   two     chocolate       1
3   two     vanilla         1
4   one     vanilla         1
5   one     vanilla         1
6   one     vanilla         1
7   one     vanilla         1
8   one     vanilla         1
9   two     chocolate       1

然后

df.pivot(index='bowl', columns='cookie', values='count')

但是,我想知道是否有更直接的方法,首先不需要添加count列。

3 个答案:

答案 0 :(得分:3)

最简洁的方法可能是pandas.crosstab函数:

>>> pandas.crosstab(d.bowl, d.cookie)
cookie  chocolate  vanilla
bowl                      
one             1        5
two             3        1

答案 1 :(得分:2)

您可以使用pivot_table()方法:

In [33]: df.pivot_table(index='bowl', columns='cookie', aggfunc='size', fill_value=0)
Out[33]:
cookie  chocolate  vanilla
bowl
one             1        5
two             3        1

或者您可以使用groupby()size()unstack() - pivot_table()如何做到这一点:

In [36]: df.groupby(['bowl', 'cookie']).size().unstack('cookie', fill_value=0)
Out[36]:
cookie  chocolate  vanilla
bowl
one             1        5
two             3        1

100K行的时间DF:

In [48]: big = pd.concat([df] * 10**4, ignore_index=True)

In [49]: big.shape
Out[49]: (100000, 2)

In [50]: %timeit pd.crosstab(big.bowl, big.cookie)
10 loops, best of 3: 58 ms per loop

In [51]: %timeit big.pivot_table(index='bowl', columns='cookie', aggfunc='size', fill_value=0)
10 loops, best of 3: 38.4 ms per loop

In [52]: %timeit big.groupby(['bowl', 'cookie']).size().unstack('cookie', fill_value=0)
10 loops, best of 3: 34.2 ms per loop

In [118]: %timeit pir(big)
1 loop, best of 3: 631 ms per loop

In [119]: big.shape
Out[119]: (100000, 2)

1M行的时间DF:

In [53]: big = pd.concat([big] * 10, ignore_index=True)

In [54]: big.shape
Out[54]: (1000000, 2)

In [55]: %timeit pd.crosstab(big.bowl, big.cookie)
1 loop, best of 3: 446 ms per loop

In [56]: %timeit big.pivot_table(index='bowl', columns='cookie', aggfunc='size', fill_value=0)
1 loop, best of 3: 333 ms per loop

In [57]: %timeit big.groupby(['bowl', 'cookie']).size().unstack('cookie', fill_value=0)
1 loop, best of 3: 327 ms per loop

In [121]: %timeit pir(big)
1 loop, best of 3: 7.08 s per loop

In [122]: big.shape
Out[122]: (1000000, 2)

答案 2 :(得分:1)

一种笨拙的方法

from itertools import product
import pandas as pd
import numpy as np

def pir(df):
    ub = pd.Index(np.unique(df.values[:, 0]), name='bowl')
    uc = pd.Index(np.unique(df.values[:, 1]), name='cookie')
    u = np.array(list(product(ub.values, uc.values)))
    e = u[:, None] == df.values

    return pd.DataFrame(
        e.all(2).sum(1).reshape(-1, 2),
        ub, uc
    )

pir(df)

enter image description here