考虑这样的DataFrame:
size = 10
d = {
'id': np.random.randint(1, 10, size),
'value': np.random.randint(10, 100, size)
}
df = pd.DataFrame(data=d)
# Now for each row I'm counting how many previous other rows have the same id
df['others_count'] = df.groupby(['id']).cumcount()+1
哪个会产生以下内容:
id value others_count
0 3 76 1
1 4 12 1
2 1 96 1
3 6 33 1
4 4 49 2
5 8 72 1
6 8 68 2
7 7 78 1
8 9 99 1
9 1 66 2
对于至少与另一行共享id
的行(在我的示例4、6和9中),我必须添加另一列,其中包含value
列的平均值属于该ID的所有上方行。
我提供的这种解决方案效率不高,我怀疑它也存在某种缺陷:
for row in range(0, df.shape[0]):
if df['id'][row] > 1:
address = df['id'][row]
others = df['others_count'][row]
df.loc[row, 'value_estimated'] = df.loc[(df['id']==address)&(df['others_count']<others), 'value'].mean()
哪个输出如下:
id value others_count value_estimated
0 3 76 1 NaN
1 4 12 1 NaN
2 1 96 1 NaN
3 6 33 1 NaN
4 4 49 2 12.0
5 8 72 1 NaN
6 8 68 2 72.0
7 7 78 1 NaN
8 9 99 1 NaN
9 1 66 2 NaN
第4行和第8行正确,但最后一行的value_estimated应该为96正确。
您对此有更好的解决方案吗?
答案 0 :(得分:2)
IIUC,您可以使用groupby
和expanding
id
上的mean()
和shift
来将值1向下移动。:
df['value_estimated']=df.groupby('id')['value'].apply(lambda x:
x.expanding().mean().shift())
print(df)
id value others_count value_estimated
0 3 76 1 NaN
1 4 12 1 NaN
2 1 96 1 NaN
3 6 33 1 NaN
4 4 49 2 12.0
5 8 72 1 NaN
6 8 68 2 72.0
7 7 78 1 NaN
8 9 99 1 NaN
9 1 66 2 96.0