如何使用“分组依据”和“剪切”方法在熊猫数据框中使用连续分布对列值范围进行分组?

时间:2019-12-28 13:32:28

标签: python pandas dataframe range backend

我有一个如下的pandas数据框,显示石油产品轻柴油的最小,最大和平均销售量,我想从中生成显示5年间隔(例如2010-2014)的石油产品的最小,最大和平均销售量的数据框, 2015-2019,..等等,其中包括结束年份。

假设以下数据框的名称为“ lightdiesel_df”

   petroleum_product  year  max_sale  min_sale  avg_sale
0   Light Diesel Oil  2014         0         0       0.0
1   Light Diesel Oil  2013         0         0       0.0
2   Light Diesel Oil  2012       258       258     258.0
3   Light Diesel Oil  2011         0         0       0.0
4   Light Diesel Oil  2010       227       227     227.0
5   Light Diesel Oil  2009       238       238     238.0
6   Light Diesel Oil  2008       377       377     377.0
7   Light Diesel Oil  2007       306       306     306.0
8   Light Diesel Oil  2006       179       179     179.0
9   Light Diesel Oil  2005       290       290     290.0
10  Light Diesel Oil  2004        88        88      88.0
11  Light Diesel Oil  2003       577       577     577.0
12  Light Diesel Oil  2002       610       610     610.0
13  Light Diesel Oil  2001      2413      2413    2413.0
14  Light Diesel Oil  2000      3416      3416    3416.0

因此,基本上,我希望将以下输出作为:

petroleum_product   year      min_sale  max_sale  avg_sale
Light Diesel Oil    2010-2014   227     258        242.5
Light Diesel Oil    2005-2009   179     377        278
Light Diesel Oil    2000-2004   88     3416       1420.8

3 个答案:

答案 0 :(得分:2)

尝试使用Grouper传递频率(5年)和参数 closed ='left',如下所示:

df2['year'] = pd.to_datetime(df2['year'], format = '%Y')

(df2.groupby(['petroleum_product', pd.Grouper(key = 'year', freq = '5Y', closed = 'left')])
    .agg(
      {'year': lambda x: '-'.join((str(min(x.dt.year)), str(max(x.dt.year)))),
      'max_sale' : 'max',
      'min_sale' : 'min',
      'avg_sale' : 'mean'
    }).reset_index(level= 0).reset_index(drop=True)
)
#output:

    petroleum_product   year        max_sale    min_sale    avg_sale
0   Light Diesel Oil    2000-2004   3416        88          1420.8
1   Light Diesel Oil    2005-2009   377         179         278.0
2   Light Diesel Oil    2010-2014   258         0           97.0

答案 1 :(得分:1)

您还可以在year列和labels中创建垃圾箱以根据预期的输出进行格式化后,尝试使用pd.cut

bins=[*range(df['year'].min(),df['year'].max()+5)][::5]
#output : [2000, 2005, 2010, 2015]
labels=[f"{a}-{b-1}" for a,b in zip(bins,bins[1::])]
#output: ['2000-2004', '2005-2009', '2010-2014']
s=pd.cut(df['year'],bins,labels=labels,include_lowest=True,right=False)

final=(df.assign(year=s).groupby(['petroleum_product','year'],sort=False,as_index=False)
 .agg({'max_sale':'max', 'min_sale':'min','avg_sale':'mean'}))

  petroleum_product       year  max_sale  min_sale  avg_sale
0  Light Diesel Oil  2010-2014      3416        88    1420.8
1  Light Diesel Oil  2005-2009       377       179     278.0
2  Light Diesel Oil  2000-2004       258         0      97.0

答案 2 :(得分:0)

请尝试

pd.cut用于在特定范围内分割df

df['year_range']=pd.cut(df.year, [1999,2004,2009,2015])

df_res=df.groupby(['petroleum_product','year_range']).agg({'max_sale':'max', 
'min_sale':'min','avg_sale':'mean'})