我有一个国家/地区,城市,产品和销售额(以美元为单位)的数据框。 我需要获取每个国家,地区,城市和“其他”下其余产品的前3个产品,以及相关的销售和单位
最终结果是每个国家,地区和城市组合的前3个产品+“其他”
Country Region City Product Sales Val
Europe Italy Milan Pen 8200 5
Europe Italy Milan Phone 1500 10
Europe Italy Milan Book 300 5
Europe Italy Milan Other 400 25
前3名的结果
{{1}}
答案 0 :(得分:1)
首先需要通过reset_index
创建默认索引:
df = df.reset_index(drop=True)
然后按sort_values
和GroupBy.head
按Sales
列进行排序,得出每组前3行:
cols = ['Country','Region', 'City']
df1 = df.sort_values('Sales', ascending=False).groupby(cols).head(3)
print (df1)
Country Region City Product Sales Val
5 Europe Italy Milan Pen 8200 5
2 Europe Italy Milan Phone 1500 10
1 Europe Italy Milan Book 300 5
然后过滤掉用于top3的行并汇总sum
:
df2 = df.loc[df.index.difference(df1.index)]
df2 = df2.groupby(cols, as_index=False).sum().assign(Product='Other')
print (df2)
Country Region City Sales Val Product
0 Europe Italy Milan 400 25 Other
最后一个由concat
连接在一起的人:
df = pd.concat([df1, df2]).sort_values(cols).reset_index(drop=True)
print (df)
City Country Product Region Sales Val
0 Milan Europe Pen Italy 8200 5
1 Milan Europe Phone Italy 1500 10
2 Milan Europe Book Italy 300 5
3 Milan Europe Other Italy 400 25
另一种解决方案:
print (df)
Country Region City Product Sales Val
0 Europe Italy Milan Ring 100 10
1 Europe Italy Milan Book 300 5
2 Europe Italy Milan Phone 1500 10
3 Europe Italy Milan Car 200 5
4 Europe Italy Milan Ring 100 10
5 Europe Italy Rome Pen 8200 5
6 Europe Italy Rome Ring 100 10
7 Europe Italy Rome Book 300 5
8 Europe Italy Rome Phone 1500 10
9 Europe Italy Rome Car 200 5
10 Europe Italy Rome Ring 100 10
11 Europe Italy Rome Pencil 8100 5
想法是按Sales
对值进行排序,并按cumcount
按组创建计数器列,并将Product
的值替换为Other
:
cols = ['Country','Region', 'City']
df['g'] = df.sort_values('Sales', ascending=False).groupby(cols).cumcount()
df['Product'] = np.where(df['g'] >= 3 , 'Other', df['Product'])
print (df)
Country Region City Product Sales Val g
0 Europe Italy Milan Other 100 10 3
1 Europe Italy Milan Book 300 5 1
2 Europe Italy Milan Phone 1500 10 0
3 Europe Italy Milan Car 200 5 2
4 Europe Italy Milan Other 100 10 3
5 Europe Italy Rome Pen 8200 5 0
6 Europe Italy Rome Other 100 10 3
7 Europe Italy Rome Other 300 5 3
8 Europe Italy Rome Phone 1500 10 2
9 Europe Italy Rome Other 200 5 3
10 Europe Italy Rome Other 100 10 3
11 Europe Italy Rome Pencil 8100 5 1
然后通过sum
进行汇总:
df2 = (df.groupby(cols + ['Product'], as_index=False).sum()
.sort_values(cols + ['g'])
.drop('g', axis=1)
.reset_index(drop=True))
print (df2)
Country Region City Product Sales Val
0 Europe Italy Milan Phone 1500 10
1 Europe Italy Milan Book 300 5
2 Europe Italy Milan Car 200 5
3 Europe Italy Milan Other 200 20
4 Europe Italy Rome Pen 8200 5
5 Europe Italy Rome Pencil 8100 5
6 Europe Italy Rome Phone 1500 10
7 Europe Italy Rome Other 700 30