熊猫:前N名,其余总数。对于每个组

时间:2018-11-27 13:35:07

标签: python pandas dataframe

我有一个国家/地区,城市,产品和销售额(以美元为单位)的数据框。 我需要获取每个国家,地区,城市和“其他”下其余产品的前3个产品,以及相关的销售和单位

最终结果是每个国家,地区和城市组合的前3个产品+“其他”

Country Region City Product Sales Val
Europe Italy Milan Pen 8200 5 
Europe Italy Milan Phone 1500 10 
Europe Italy Milan Book 300 5 
Europe Italy Milan Other 400 25

前3名的结果

{{1}}

1 个答案:

答案 0 :(得分:1)

首先需要通过reset_index创建默认索引:

df = df.reset_index(drop=True)

然后按sort_valuesGroupBy.headSales列进行排序,得出每组前3行:

cols = ['Country','Region', 'City']
df1 = df.sort_values('Sales', ascending=False).groupby(cols).head(3)
print (df1)
  Country Region   City Product  Sales  Val
5  Europe  Italy  Milan     Pen   8200    5
2  Europe  Italy  Milan   Phone   1500   10
1  Europe  Italy  Milan    Book    300    5

然后过滤掉用于top3的行并汇总sum

df2 = df.loc[df.index.difference(df1.index)]
df2 = df2.groupby(cols, as_index=False).sum().assign(Product='Other')
print (df2)
  Country Region   City  Sales  Val Product
0  Europe  Italy  Milan    400   25   Other

最后一个由concat连接在一起的人:

df = pd.concat([df1, df2]).sort_values(cols).reset_index(drop=True)
print (df)
    City Country Product Region  Sales  Val
0  Milan  Europe     Pen  Italy   8200    5
1  Milan  Europe   Phone  Italy   1500   10
2  Milan  Europe    Book  Italy    300    5
3  Milan  Europe   Other  Italy    400   25

另一种解决方案:

print (df)
   Country Region   City Product  Sales  Val
0   Europe  Italy  Milan    Ring    100   10
1   Europe  Italy  Milan    Book    300    5
2   Europe  Italy  Milan   Phone   1500   10
3   Europe  Italy  Milan     Car    200    5
4   Europe  Italy  Milan    Ring    100   10
5   Europe  Italy   Rome     Pen   8200    5
6   Europe  Italy   Rome    Ring    100   10
7   Europe  Italy   Rome    Book    300    5
8   Europe  Italy   Rome   Phone   1500   10
9   Europe  Italy   Rome     Car    200    5
10  Europe  Italy   Rome    Ring    100   10
11  Europe  Italy   Rome  Pencil   8100    5

想法是按Sales对值进行排序,并按cumcount按组创建计数器列,并将Product的值替换为Other

cols = ['Country','Region', 'City']
df['g'] = df.sort_values('Sales', ascending=False).groupby(cols).cumcount()
df['Product'] = np.where(df['g'] >= 3 , 'Other', df['Product'])
print (df)
   Country Region   City Product  Sales  Val  g
0   Europe  Italy  Milan   Other    100   10  3
1   Europe  Italy  Milan    Book    300    5  1
2   Europe  Italy  Milan   Phone   1500   10  0
3   Europe  Italy  Milan     Car    200    5  2
4   Europe  Italy  Milan   Other    100   10  3
5   Europe  Italy   Rome     Pen   8200    5  0
6   Europe  Italy   Rome   Other    100   10  3
7   Europe  Italy   Rome   Other    300    5  3
8   Europe  Italy   Rome   Phone   1500   10  2
9   Europe  Italy   Rome   Other    200    5  3
10  Europe  Italy   Rome   Other    100   10  3
11  Europe  Italy   Rome  Pencil   8100    5  1

然后通过sum进行汇总:

df2 = (df.groupby(cols + ['Product'], as_index=False).sum()
         .sort_values(cols + ['g'])
         .drop('g', axis=1)
         .reset_index(drop=True))
print (df2)
  Country Region   City Product  Sales  Val
0  Europe  Italy  Milan   Phone   1500   10
1  Europe  Italy  Milan    Book    300    5
2  Europe  Italy  Milan     Car    200    5
3  Europe  Italy  Milan   Other    200   20
4  Europe  Italy   Rome     Pen   8200    5
5  Europe  Italy   Rome  Pencil   8100    5
6  Europe  Italy   Rome   Phone   1500   10
7  Europe  Italy   Rome   Other    700   30