从Pandas中的分组数据中获取最大的n个项目

时间:2018-12-19 17:32:44

标签: python pandas group-by

使用此数据

import pandas as pd 
df=pd.read_excel(
    "https://github.com/chris1610/pbpython/blob/master/data/sample-salesv3.xlsx?raw=True"
    )
df["date"] = pd.to_datetime(df['date']) df.head()

我使用此代码按月将ext pricename分组了

df.groupby([pd.Grouper(key='date', freq='M'), 'name'])['ext price'].sum()

出:

date        name                           
2014-01-31  Barton LLC                          6177.57
            Cronin, Oberbrunner and Spencer     1141.75
            Frami, Hills and Schmidt            5112.34
            Fritsch, Russel and Anderson       15130.77
            Halvorson, Crona and Champlin       9997.17
            Herman LLC                         10749.84
            Jerde-Hilpert                      11274.33
            Kassulke, Ondricka and Metz         7322.83
            Keeling LLC                         6847.86
            Kiehn-Spinka                        8097.50
            Koepp Ltd                          10768.33
            Kuhn-Gusikowski                     7309.54
            Kulas Inc                          15398.87
            Pollich LLC                         1004.22
            Purdy-Kunde                         4689.37
            Sanford and Sons                    9544.13
            Stokes LLC                          5809.34
            Trantow-Barrows                    14328.26
            White-Trantow                      13703.77
            Will LLC                           20953.87
2014-02-28  Barton LLC                         12218.03
            Cronin, Oberbrunner and Spencer    13976.26
            Frami, Hills and Schmidt            4124.53
            Fritsch, Russel and Anderson        9595.35
            Halvorson, Crona and Champlin       7082.15
            Herman LLC                          5831.40
            Jerde-Hilpert                       4088.40
            Kassulke, Ondricka and Metz         3061.12
            Keeling LLC                         3383.45
            Kiehn-Spinka                        3461.12
                                                 ...   
2014-11-30  Koepp Ltd                           4882.27
            Kuhn-Gusikowski                     7197.89
            Kulas Inc                           4149.34
            Pollich LLC                         6334.21
            Purdy-Kunde                         2376.00
            Sanford and Sons                    6834.04
            Stokes LLC                          6158.81
            Trantow-Barrows                     6550.10
            White-Trantow                       9544.61
            Will LLC                            3210.44
2014-12-31  Barton LLC                          2772.90
            Cronin, Oberbrunner and Spencer     7640.60
            Frami, Hills and Schmidt           16249.81
            Fritsch, Russel and Anderson       12345.64
            Halvorson, Crona and Champlin       2900.51
            Herman LLC                          4664.54
            Jerde-Hilpert                       6941.99
            Kassulke, Ondricka and Metz         4425.22
            Keeling LLC                        13247.88
            Kiehn-Spinka                       17401.28
            Koepp Ltd                          11791.00
            Kuhn-Gusikowski                     4959.85
            Kulas Inc                           6106.38
            Pollich LLC                        12357.76
            Purdy-Kunde                         4051.79
            Sanford and Sons                    2151.48
            Stokes LLC                          6366.26
            Trantow-Barrows                    10124.23
            White-Trantow                       4806.93
            Will LLC                           12561.21
Name: ext price, Length: 240, dtype: float64

现在,我尝试获取每个name

的前5名ext price(前month

我尝试过nlargest(5),但是它不起作用

函数head(5)也不能解决问题

4 个答案:

答案 0 :(得分:1)

也许有一种更快的方法,但是您可以按价格排序,按日期分组,获取前5名,然后对索引(也就是日期)进行排序。

s = df.groupby([pd.Grouper(key='date', freq='M'), 'name'])['ext price'].sum()
out = (s.reset_index(level=-1)
        .sort_values('ext price', ascending=False)
        .groupby('date')
        .head(5)
        .sort_index()
    )

out
# returns:
                                       name  ext price
date
2014-01-31                         Will LLC   20953.87
2014-01-31                    White-Trantow   13703.77
2014-01-31                  Trantow-Barrows   14328.26
2014-01-31     Fritsch, Russel and Anderson   15130.77
2014-01-31                        Kulas Inc   15398.87
2014-02-28                         Will LLC   13613.06
2014-02-28                        Kulas Inc   13973.65
2014-02-28  Cronin, Oberbrunner and Spencer   13976.26
2014-02-28                    White-Trantow   11783.98
2014-02-28                       Barton LLC   12218.03
...
2014-10-31                    White-Trantow   19896.85
2014-10-31                        Kulas Inc   15754.28
2014-10-31                  Kuhn-Gusikowski   10559.77
2014-10-31                    Jerde-Hilpert   11401.27
2014-10-31     Fritsch, Russel and Anderson   10686.18
2014-11-30                      Keeling LLC   10310.96
2014-11-30                    White-Trantow    9544.61
2014-11-30      Kassulke, Ondricka and Metz    8836.82
2014-11-30                  Kuhn-Gusikowski    7197.89
2014-11-30         Frami, Hills and Schmidt    7221.02
2014-12-31         Frami, Hills and Schmidt   16249.81
2014-12-31                      Pollich LLC   12357.76
2014-12-31                         Will LLC   12561.21
2014-12-31                     Kiehn-Spinka   17401.28
2014-12-31                      Keeling LLC   13247.88

答案 1 :(得分:1)

另一种选择是nlargest,但可能不会比James的建议快,因为排序和获取headtail应该比nlargest快:

new = df.groupby([pd.Grouper(key='date', freq='M'), 'name'])['ext price'].sum()
new.groupby(level=0).nlargest(5).sort_index().reset_index(level=1, drop=True).to_frame()

答案 2 :(得分:0)

您可以为此使用idxmax函数:

df[df.groupby('month')['ext price'].idxmax()]

答案 3 :(得分:0)

我相信这就是您想要的:

df = df.groupby([pd.Grouper(key="date", freq="M"), "name"])["ext price"].sum()
out = df.groupby(level=0).nlargest(5)
out.index = out.index.droplevel()

然后您将获得以下数据:

In [134]: out.head(10)
Out[134]: 
date        name                           
2014-01-31  Will LLC                           20953.87
            Kulas Inc                          15398.87
            Fritsch, Russel and Anderson       15130.77
            Trantow-Barrows                    14328.26
            White-Trantow                      13703.77
2014-02-28  Cronin, Oberbrunner and Spencer    13976.26
            Kulas Inc                          13973.65
            Will LLC                           13613.06
            Barton LLC                         12218.03
            White-Trantow                      11783.98
dtype: float64