我正在尝试对n-1行应用groupby->均值,然后将均值分配给熊猫中的第n行。这是以下代码和所需的输出。它需要很长时间才能运行,我想知道是否有人知道如何对其进行优化。
import pandas as pd
import numpy as np
df = pd.DataFrame({'id': ['A', 'A', 'A', 'B', 'B', 'C'],
'vals': [2, 3, 4, 5, 6, 7]})
# current solution
for h in df['id'].unique():
h_df = df[df['id'] == h]
indices = h_df.index
size = h_df.shape[0]
last_index = indices[size-1]
if size == 1:
df.iloc[last_index, df.columns.get_loc('vals')] = np.nan
continue
exclude_last = h_df[:size-1]
avg = (exclude_last.groupby('id')['vals'].mean()).values[0]
df.iloc[last_index, df.columns.get_loc('vals')] = avg
# output
# id vals
# A 2
# A 3
# A 2.5 => (2+3) / 2
# B 5
# B 5 => (5/1)
# C np.nan
答案 0 :(得分:0)
没有理由遍历唯一值并选择组并进行另一个groupby。 .groupby
本身可以完成所有操作:
In [1]: def mean_head(group):
...: group.vals.iloc[-1] = group.vals.iloc[:-1].mean()
...: return group
...:
In [2]: df.groupby("id").apply(mean_head)
Out[2]:
id vals
0 A 2.0
1 A 3.0
2 A 2.5
3 B 5.0
4 B 5.0
5 C NaN