基于多个条件熊猫的Groupby聚合

时间:2020-02-11 12:55:40

标签: pandas pandas-groupby

我有一个如下所示的数据框

Sector    Plot    Year       Amount   Month
SE1       1       2017       10       Sep
SE1       1       2018       10       Oct
SE1       1       2019       10       Jun
SE1       1       2020       90       Feb
SE1       2       2018       50       Jan
SE1       2       2017       100      May
SE1       2       2018       30       Oct
SE2       2       2018       50       Mar
SE2       2       2019       100      Jan

从上面我想在下面的数据框里准备

Sector    Plot      Number_of_Times    Mean_Amount    Recent_Amount   Recent_year  Recent_Month    
SE1       1         4                  30             50              2020         Feb   
SE1       2         3                  60             30              2018         Oct
SE2       2         2                  75             100             2019         Jan

1 个答案:

答案 0 :(得分:1)

因此,如果所有行都按输入数据排序,请使用GroupBy.agg和命名聚合:

df1 = (df.groupby(['Sector','Plot']).agg(Number_of_Times=('Year','size'),
                                         Mean_Amount=('Amount','mean'),
                                         Recent_Amount=('Amount','last'),
                                         Recent_year=('Year','last'),
                                         Recent_Month=('Month','last')).reset_index())
print (df1)
  Sector  Plot  Number_of_Times  Mean_Amount  Recent_Amount  Recent_year  \
0    SE1     1                4           30             90         2020   
1    SE1     2                3           60             30         2018   
2    SE2     2                2           75            100         2019   

  Recent_Month  
0          Feb  
1          Oct  
2          Jan  

如有必要,将Month转换为日期时间,添加DataFrame.sort_values,应用解决方案,最后将月份转换回字符串:

df['Month'] = pd.to_datetime(df['Month'], format='%b')

df1 = (df.sort_values(['Sector','Plot','Year','Month'])
         .groupby(['Sector','Plot']).agg(Number_of_Times=('Year','size'),
                                         Mean_Amount=('Amount','mean'),
                                         Recent_Amount=('Amount','last'),
                                         Recent_year=('Year','last'),
                                         Recent_Month=('Month','last')).reset_index())
df1['Recent_Month'] = df1['Recent_Month'].dt.strftime('%b')
print (df1)
  Sector  Plot  Number_of_Times  Mean_Amount  Recent_Amount  Recent_year  \
0    SE1     1                4           30             90         2020   
1    SE1     2                3           60             30         2018   
2    SE2     2                2           75            100         2019   

  Recent_Month  
0          Feb  
1          Oct  
2          Jan  

另一个想法,在熊猫0.25.1中越野车:

months = ['Jan', 'Feb', 'Mar', 'Apr', 'May', 'Jun', 'Jul', 'Aug', 'Sep', 'Oct', 'Nov', 'Dec']
df['Month']  = pd.Categorical(df['Month'] , ordered=True, categories=months)

df1 = (df.sort_values(['Sector','Plot','Year','Month'])
         .groupby(['Sector','Plot']).agg(Number_of_Times=('Year','size'),
                                         Mean_Amount=('Amount','mean'),
                                         Recent_Amount=('Amount','last'),
                                         Recent_year=('Year','last'),
                                         Recent_Month=('Month','last')).reset_index())

print (df1)

ValueError:缓冲区dtype不匹配,预期为“ Python对象”,但为“ long long”