Pandas - aggregate, sort and nlargest inside groupby

时间:2016-12-26 16:32:36

标签: python pandas

I have following dataframe:

                       some_id
2016-12-26 11:03:10        001
2016-12-26 11:03:13        001
2016-12-26 12:03:13        001
2016-12-26 12:03:13        008
2016-12-27 11:03:10        009
2016-12-27 11:03:13        009
2016-12-27 12:03:13        003
2016-12-27 12:03:13        011

And i need to do something like transform('size') with following sort and get N max values. To get something like this (N=2):

             some_id   size
2016-12-26       001      3
                 008      1
2016-12-27       009      2
                 003      1

Is there elegant way to do that in pandas 0.19.x?

4 个答案:

答案 0 :(得分:4)

date的{​​{1}}部分进行分组后,使用value_counts计算不同的计数。默认情况下,它按降序排序。

你只需要取这个结果的最前两行来获得最大的(前2个)部分。

DateTimeIndex

enter image description here

答案 1 :(得分:2)

设置

from io import StringIO
import pandas as pd

txt = """                 some_id
2016-12-26 11:03:10        001
2016-12-26 11:03:13        001
2016-12-26 12:03:13        001
2016-12-26 12:03:13        008
2016-12-27 11:03:10        009
2016-12-27 11:03:13        009
2016-12-27 12:03:13        003
2016-12-27 12:03:13        011"""

df = pd.read_csv(StringIO(txt), sep='\s{2,}', engine='python')

df.index = pd.to_datetime(df.index)
df.some_id = df.some_id.astype(str).str.zfill(3)

df

                    some_id
2016-12-26 11:03:10     001
2016-12-26 11:03:13     001
2016-12-26 12:03:13     001
2016-12-26 12:03:13     008
2016-12-27 11:03:10     009
2016-12-27 11:03:13     009
2016-12-27 12:03:13     003
2016-12-27 12:03:13     011

使用nlargest

df.groupby(pd.TimeGrouper('D')).some_id.value_counts() \
    .groupby(level=0, group_keys=False).nlargest(2)

            some_id
2016-12-26  001        3
            008        1
2016-12-27  009        2
            003        1
Name: some_id, dtype: int64

答案 2 :(得分:2)

您应该可以在一行中完成此操作。

df.resample('D')['some_id'].apply(lambda s: s.value_counts().iloc[:2])

答案 3 :(得分:0)

如果您已有sizes列,则可以使用以下内容。

df.groupby('some_id')['size'].value_counts().groupby(level=0).nlargest(2)

否则,你可以使用这种方法。

import pandas as pd

df = pd.DataFrame({'some_id':[1,1,1,8,9,9,3,11],
                   'some_idx':[26,26,26,26,27,27,27,27]})

sizes = df.groupby(['some_id', 'some_idx']).size()

sizes.groupby(level='some_idx').nlargest(2)

# some_idx  some_id  some_idx
# 26        1        26          3
#           8        26          1
# 27        9        27          2
#           3        27          1