我有一个看起来像这样的pandas数据集
city difference
NY 6
SF 8
LA 8
NY 9
SF 10
我想根据difference
列总结city
列的值,以便我的最终数据集看起来像
city difference total difference
NY 6 15
NY 9
LA 8 8
SF 10 10
我试过
df['total difference'] = df.groupby('city')['difference'].sum()
但它不起作用。我甚至尝试了How to sum values of particular rows in pandas?,但新列的值为NaN
。请帮忙!
答案 0 :(得分:4)
我认为你需要transform
:
df['total difference'] = df.groupby('city')['difference'].transform(sum)
print (df)
city difference total difference
0 NY 6 15
1 SF 8 18
2 LA 8 8
3 NY 9 15
4 SF 10 18
如果还需要排序列:
df['total difference'] = df.groupby('city')['difference'].transform('sum')
df = df.sort_values('city')
print (df)
city difference total difference
2 LA 8 8
0 NY 6 15
3 NY 9 15
1 SF 8 18
4 SF 10 18
我感兴趣的是功能和时间上的差异非常相似:
#[10000000 rows x 2 columns]
np.random.seed(100)
df = pd.DataFrame(np.random.randint(1000, size=(10000000,2)), columns=['city','difference'])
#print (df)
In [293]: %timeit (df.groupby('city')['difference'].transform('sum'))
1 loop, best of 3: 570 ms per loop
In [294]: %timeit (df.groupby('city')['difference'].transform(sum))
1 loop, best of 3: 567 ms per loop
In [295]: %timeit (df.groupby('city')['difference'].transform(np.sum))
1 loop, best of 3: 561 ms per loop