Question

考虑这样的DataFrame：

size = 10
d = {
    'id': np.random.randint(1, 10, size),
    'value': np.random.randint(10, 100, size)
}
df = pd.DataFrame(data=d)

# Now for each row I'm counting how many previous other rows have the same id
df['others_count'] = df.groupby(['id']).cumcount()+1

哪个会产生以下内容：

   id  value  others_count
0   3     76             1
1   4     12             1
2   1     96             1
3   6     33             1
4   4     49             2
5   8     72             1
6   8     68             2
7   7     78             1
8   9     99             1
9   1     66             2

对于至少与另一行共享id的行（在我的示例4、6和9中），我必须添加另一列，其中包含value列的平均值属于该ID的所有上方行。

我提供的这种解决方案效率不高，我怀疑它也存在某种缺陷：

for row in range(0, df.shape[0]):
    if df['id'][row] > 1:
        address = df['id'][row]
        others = df['others_count'][row]
        df.loc[row, 'value_estimated'] = df.loc[(df['id']==address)&(df['others_count']<others), 'value'].mean()

哪个输出如下：

   id  value  others_count  value_estimated
0   3     76             1              NaN
1   4     12             1              NaN
2   1     96             1              NaN
3   6     33             1              NaN
4   4     49             2             12.0
5   8     72             1              NaN
6   8     68             2             72.0
7   7     78             1              NaN
8   9     99             1              NaN
9   1     66             2              NaN

第4行和第8行正确，但最后一行的value_estimated应该为96正确。

您对此有更好的解决方案吗？

Answer 1

IIUC，您可以使用groupby和expanding id上的mean()和shift来将值1向下移动。：

df['value_estimated']=df.groupby('id')['value'].apply(lambda x: 
                                           x.expanding().mean().shift())
print(df)

   id  value  others_count  value_estimated
0   3     76             1              NaN
1   4     12             1              NaN
2   1     96             1              NaN
3   6     33             1              NaN
4   4     49             2             12.0
5   8     72             1              NaN
6   8     68             2             72.0
7   7     78             1              NaN
8   9     99             1              NaN
9   1     66             2             96.0

如何在考虑行的子集的同时迭代Pandas DataFrame

1 个答案: