Question

在组中应用sort_values（）和cumsum（）时遇到了问题。

我有一个数据集：

基本上，我需要对组中的值进行排序，获得累积销售额，然后选择构成销售额的90％的那些行。

首先获得

然后选择每个区域内90％的销售额

我尝试了以下操作，但最后一行不起作用。我返回错误：无法访问“ SeriesGroupBy”对象的可调用属性“ sort_values”，请尝试使用“ apply”方法

我也尝试过申请。

import pandas as pd
df = pd.DataFrame({'id':['id_1', 
'id_2','id_3','id_4','id_5','id_6','id_7','id_8', 'id_1', 
'id_2','id_3','id_4','id_5','id_6','id_7','id_8'],
               'region':[1,1,1,1,1,1,1,1,2,2,2,2,2,2,2,2,],
               'sales':[54,34,23,56,78,98,76,34,27,89,76,54,34,45,56,54]})
df['%']=df['sales']/df.groupby(df['region'])['sales'].transform('sum')
df['cumul'] = df.groupby(df['region'])['sales'].sort_values(ascending=False).cumsum()

谢谢您的建议

Answer 1

您绝对可以先对数据框进行排序，然后执行groupby()：

df.sort_values(['region','sales'], ascending=[True,False],inplace=True)

df['%']=df['sales']/df.groupby(df['region'])['sales'].transform('sum')

df['cummul'] = df.groupby('region')['%'].cumsum()

# filter
df[df['cummul'].le(0.9)]

输出：

      id  region  sales         %    cummul
5   id_6       1     98  0.216336  0.216336
4   id_5       1     78  0.172185  0.388521
6   id_7       1     76  0.167770  0.556291
3   id_4       1     56  0.123620  0.679912
0   id_1       1     54  0.119205  0.799117
1   id_2       1     34  0.075055  0.874172
9   id_2       2     89  0.204598  0.204598
10  id_3       2     76  0.174713  0.379310
14  id_7       2     56  0.128736  0.508046
11  id_4       2     54  0.124138  0.632184
15  id_8       2     54  0.124138  0.756322
13  id_6       2     45  0.103448  0.859770

Answer 2

首先，我们使用您的逻辑来创建%列，但是将multiply和100 round乘以整数。

然后，我们按region和%进行排序，不需要groupby。

排序后，我们创建cumul列。

最后，我们用90%选择query范围内的那些：

df['%'] = df['sales'].div(df.groupby('region')['sales'].transform('sum')).mul(100).round()
df = df.sort_values(['region', '%'], ascending=[True, False])
df['cumul'] = df.groupby('region')['%'].cumsum()

df.query('cumul.le(90)')

输出

      id  region  sales     %  cumul
5   id_6       1     98  22.0   22.0
4   id_5       1     78  17.0   39.0
6   id_7       1     76  17.0   56.0
0   id_1       1     54  12.0   68.0
3   id_4       1     56  12.0   80.0
1   id_2       1     34   8.0   88.0
9   id_2       2     89  20.0   20.0
10  id_3       2     76  17.0   37.0
14  id_7       2     56  13.0   50.0
11  id_4       2     54  12.0   62.0
15  id_8       2     54  12.0   74.0
13  id_6       2     45  10.0   84.0

Answer 3

如果您只需要没有百分比的销售数据，则可以通过方法链接轻松完成：

var button = document.getElementById('clicker')
if (button) {
    button.click();
}

输出

(
  df
  .sort_values(by='sales', ascending=False)
  .groupby('region')
  .apply(lambda x[x.sales > x.sales.quantile(.1)])
  .reset_index(level=0, drop=True)
)

之所以可行，是因为获得所有大于10％的值与获得前90％的值基本相同。

累积总和在组内降序排列。大熊猫

3 个答案: