在pandas Dataframe df
中我有这样的列:
NAME KEYWORD AMOUNT INFO
0 orange fruit 13 from italy
1 potato veggie 7 from germany
2 potato veggie 9 from germany
3 orange fruit 8 from italy
4 potato veggie 6 from germany
执行groupby KEYWORD
操作我想构建每个组的AMOUNT
值之和,并保持其他列始终是第一个值,以便结果读取:
NAME KEYWORD AMOUNT INFO
0 orange fruit 21 from italy
1 potato veggie 22 from germany
我试过
df.groupby('KEYWORD).sum()
但是这个"总结了#34;在所有列上,即我得到
NAME KEYWORD AMOUNT INFO
0 orangeorange fruit 21 from italyfrom italy
1 potatopotatopotato veggie 22 from germanyfrom germanyfrom germany
然后我尝试对不同的列使用不同的函数:
df.groupby('KEYWORD).agg({'AMOUNT': sum, 'NAME': first, ....})
与
def first(f_arg, *args):
return f_arg
但不幸的是,这给了我一个" ValueError: function does not reduce
"错误。
所以我有点不知所措。如何将sum
仅应用于AMOUNT
列,同时保留其他列?
答案 0 :(得分:2)
将groupby
+ agg
与自定义aggfunc dict一起使用。
f = dict.fromkeys(df.columns.difference(['KEYWORD']), 'first')
f['AMOUNT'] = sum
df = df.groupby('KEYWORD', as_index=False).agg(f)
df
KEYWORD NAME AMOUNT INFO
0 fruit orange 21 from italy
1 veggie potato 22 from germany
dict.fromkeys
为我提供了一个很好的方法来推广N个列。如果列顺序很重要,请在末尾添加reindex
操作:
df = df.groupby('KEYWORD', as_index=False).agg(f).reindex(columns=df.columns)
df
NAME KEYWORD AMOUNT INFO
0 orange fruit 21 from italy
1 potato veggie 22 from germany
答案 1 :(得分:1)
按列KEYWORD
使用drop_duplicates
,然后使用assign
聚合值:
df=df.drop_duplicates('KEYWORD').assign(AMOUNT=df.groupby('KEYWORD')['AMOUNT'].sum().values)
print (df)
NAME KEYWORD AMOUNT INFO
0 orange fruit 21 from italy
1 potato veggie 22 from germany