考虑给此DataFrame提供许多列,但是它具有在'feature'
列中定义的功能和在'values'
列中的某些值。
我需要在额外的一列中每个要素(组)的相对值。期望的结果由我在'desired'
列中手动进行了预先计算
df = pd.DataFrame(
data={
'feature': [1, 1, 2, 3, 3, 3],
'values': [30.0, 20.0, 25.0, 100.0, 250.0, 50.0],
'desired': [0.6, 0.4, 1.0, 0.25, 0.625, 0.125],
'more_columns': range(6),
},
)
哪个链接指向DataFrame
feature values desired more_columns
0 1 30.0 0.600 0
1 1 20.0 0.400 1
2 2 25.0 1.000 2
3 3 100.0 0.250 3
4 3 250.0 0.625 4
5 3 50.0 0.125 5
因此,对于由功能1
定义的组,期望值是0.6和0.4(因为0.6 = 30 / (20+30)
),依此类推。
我使用手动找到这些值
for feature, group in df.groupby('feature'):
rel_values = (group['values'] / group['values'].sum()).values
df[df['feature'] == feature]['result'] = rel_values # no effect
print(f'{feature}: {rel_values}')
# which prints:
1: [0.6 0.4]
2: [1.]
3: [0.25 0.625 0.125]
# but df remains unchanged
我相信大熊猫必须有一种聪明而快捷的方法来实现这一目标。
答案 0 :(得分:4)
使用GroupBy.transform
来返回Series
,其返回值是sum
,其大小与原始df
相同,因此可以除以div
:
df['new'] = df['values'].div(df.groupby('feature')['values'].transform('sum'))
print (df)
feature values desired more_columns new
0 1 30.0 0.600 0 0.600
1 1 20.0 0.400 1 0.400
2 2 25.0 1.000 2 1.000
3 3 100.0 0.250 3 0.250
4 3 250.0 0.625 4 0.625
5 3 50.0 0.125 5 0.125
详细信息:
print (df.groupby('feature')['values'].transform('sum'))
0 50.0
1 50.0
2 25.0
3 400.0
4 400.0
5 400.0
Name: values, dtype: float64
性能:
实际数据取决于组数和DataFrame
的长度。
np.random.seed(123)
N = 1000000
L = np.random.randint(1000,size=N)
df = pd.DataFrame({'feature': np.random.choice(L, N),
'values':np.random.rand(N)})
#print (df)
In [272]: %timeit df['new'] = df['values'].div(df.groupby('feature')['values'].transform('sum'))
80.7 ms ± 2.78 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
In [273]: %timeit df['desired'] = df.groupby('feature').apply(lambda g: g['values'] / g['values'].sum()).values
1.17 s ± 23.9 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [274]: %timeit df['desired'] = df.groupby('feature')['values'].transform(lambda x: x / x.sum())
727 ms ± 14.4 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
答案 1 :(得分:3)
方法1 :使用transform
df['desired'] = df.groupby('feature')['values'].transform(lambda x: x / x.sum())
方法2 :使用apply
df['desired'] = df.groupby('feature').apply(lambda g: g['values'] / g['values'].sum()).values
输出:
feature values desired more_columns
0 1 30.0 0.600 0
1 1 20.0 0.400 1
2 2 25.0 1.000 2
3 3 100.0 0.250 3
4 3 250.0 0.625 4
5 3 50.0 0.125 5