我有以下pandas
数据帧df
:
cluster tag amount name
1 0 200 Michael
2 1 1200 John
2 1 900 Daniel
2 0 3000 David
2 0 600 Jonny
3 0 900 Denisse
3 1 900 Mike
3 1 3000 Kely
3 0 2000 Devon
我需要做的是在df
中添加另一列,该列为每个row
编写,其中name
(来自名称列)具有最高的amount
,其中tag
为1。换句话说,解决方案如下:
cluster tag amount name highest_amount
1 0 200 Michael NaN
2 1 1200 John John
2 1 900 Daniel John
2 0 3000 David John
2 0 600 Jonny John
3 0 900 Denisse Kely
3 1 900 Mike Kely
3 1 3000 Kely Kely
3 0 2000 Devon Kely
我尝试过这样的事情:
df.group('clusters')['name','amount'].transform('max')[df['tag']==1]
但是问题在于名称确实在每行重复。看起来像这样:
cluster tag amount name highest_amount
1 0 200 Michael NaN
2 1 1200 John John
2 1 900 Daniel John
2 0 3000 David NaN
2 0 600 Jonny NaN
3 0 900 Denisse NaN
3 1 900 Mike Kely
3 1 3000 Kely Kely
3 0 2000 Devon NaN
有人可以让我知道如何使用拆分应用合并添加条件,并在每行上重复解决方案吗?
答案 0 :(得分:1)
您可以将其分为两个阶段进行。首先计算一个映射序列,然后按聚类进行映射:
s = df.query('tag == 1')\
.sort_values('amount', ascending=False)\
.drop_duplicates('cluster')\
.set_index('cluster')['name']
df['highest_name'] = df['cluster'].map(s)
print(df)
cluster tag amount name highest_name
0 1 0 200 Michael NaN
1 2 1 1200 John John
2 2 1 900 Daniel John
3 2 0 3000 David John
4 2 0 600 Jonny John
5 3 0 900 Denisse Kely
6 3 1 900 Mike Kely
7 3 1 3000 Kely Kely
8 3 0 2000 Devon Kely
如果您想使用groupby
,这是一种方法:
def func(x):
names = x.query('tag == 1').sort_values('amount', ascending=False)['name']
return names.iloc[0] if not names.empty else np.nan
df['highest_name'] = df['cluster'].map(df.groupby('cluster').apply(func))