如何创建前n个项目的2级分组依据

时间:2019-04-10 07:53:07

标签: python pandas dataframe pandas-groupby

我有这个数据框

          STATE              County            POP
1       Alabama      Autauga County          54571
2       Alabama      Baldwin County         182265
3       Alabama      Barbour County          27457
4       Alabama         Bibb County          22915
5       Alabama       Blount County          57322
6       Alabama      Bullock County          10914
7       Alabama       Butler County          20947
8       Alabama      Calhoun County         118572
...
3162  Wisconsin     Washburn County          15911
3163  Wisconsin   Washington County         131887
3164  Wisconsin     Waukesha County         389891
3165  Wisconsin      Waupaca County          52410
3166  Wisconsin     Waushara County          24496
3167  Wisconsin    Winnebago County         166994
3168  Wisconsin         Wood County          74749
....
3182    Wyoming      Natrona County          75450
3183    Wyoming     Niobrara County           2484
3184    Wyoming         Park County          28205
3185    Wyoming       Platte County           8667
3186    Wyoming     Sheridan County          29116

如何按州和县对数据进行分组以显示每个州的前3个县?

    STATE              COUNTY            POP
  Alabama      Baldwin County         182265
               Calhoun County          18572
                Blount County          57322
Wisconsin     Waukesha County         389891
             Winnebago County         166994
            Washington County         131887
  Wyoming         Park County          28205
               Natrona County          75450
              Sheridan County          29116

我尝试过

df.sort_values('POP',ascending=False).groupby(['STATE','COUNTY']).sum().head(3)

但它只显示前3个条目,但我希望每个组的前3个条目

                         POP
  STATE         COUNTY
Alabama Autauga County  54571
        Baldwin County  182265
        Barbour County  27457

2 个答案:

答案 0 :(得分:3)

使用两次groupby,然后依次使用nlargestreset_index

(df.groupby(['STATE', 'County'])['POP'].sum()
 .groupby(level=0, group_keys=False).nlargest(3).reset_index())

       STATE             County     POP
0    Alabama     Baldwin County  182265
1    Alabama     Calhoun County  118572
2    Alabama      Blount County   57322
3  Wisconsin    Waukesha County  389891
4  Wisconsin   Winnebago County  166994
5  Wisconsin  Washington County  131887
6    Wyoming     Natrona County   75450
7    Wyoming    Sheridan County   29116
8    Wyoming        Park County   28205

或者,如果您愿意,不要重置索引,输出将是:

STATE      County           
Alabama    Baldwin County       182265
           Calhoun County       118572
           Blount County         57322
Wisconsin  Waukesha County      389891
           Winnebago County     166994
           Washington County    131887
Wyoming    Natrona County        75450
           Sheridan County       29116
           Park County           28205

答案 1 :(得分:1)

DataFrame.sort_values的2列中使用GroupBy.head

#if necessary
#df = df.groupby(['STATE','County'], as_index=False).sum()

df = df.sort_values(['STATE','POP'], ascending=[True, False]).groupby('STATE').head(3)
print (df)
          STATE             County     POP
2       Alabama     Baldwin County  182265
8       Alabama     Calhoun County  118572
5       Alabama      Blount County   57322
3164  Wisconsin    Waukesha County  389891
3167  Wisconsin   Winnebago County  166994
3163  Wisconsin  Washington County  131887
3182    Wyoming     Natrona County   75450
3186    Wyoming    Sheridan County   29116
3184    Wyoming        Park County   28205

如果需要MultiIndex,请添加DataFrame.set_index

df = (df.sort_values(['STATE','POP'], ascending=[True, False])
        .groupby('STATE')
        .head(3)
        .set_index(['STATE','County'])