Question

以下for循环运行非常缓慢，但是它具有我要执行的操作的要旨。对于变量“ category”的每个值，我想计算当前行（不包括当前行）之前所有行的“ y_all_reg”列的移动平均值。在下面复制的代码中，计算出的值称为“已编码”。

我该如何在Pandas中进行矢量化操作？

df['encoded'] = 0 # df is already sorted by 'datetime'
categories = df['category'].unique()
for r in categories:
    subdf = df.loc[df.category == r, 'y_all_reg']
    df.loc[df.category == r, 'encoded'] = \
            subdf.expanding().mean() - subdf / subdf.expanding().count()

Answer 1

IIUC，您需要expanding().mean()和shift()：

df['y_all_reg'] = df.groupby('category')['y_all_reg'].transform(lambda x: x.expanding().mean().shift())

选项2 ：您也可以分别进行expanding().mean()和shift()：

g = df.groupby('category')
df['encoded'] = g['y_all_reg'].expanding().mean().reset_index(level=0, drop=True)
df['encoded'] = g['encoded'].shift()

选项3 ：具有更大的数据集和类别计数，您可以手动计算滚动平均值：

g = df.groupby('category')
s = g['y_all_reg'].agg(['cumsum','cumcount'])
df['encoded'] = s['cumsum']/s['cumcount'].add(1)
df['encoded'] = g['encoded'].shift()

数据：

np.random.seed(1)
df = pd.DataFrame({'category': np.random.randint(0,2,10),
                   'encoded': np.random.uniform(0,1,10)})

输出：

   category  y_all_reg      encoded
0         1  0.092339           NaN
1         1  0.186260      0.092339
2         0  0.345561           NaN
3         0  0.396767      0.345561
4         1  0.538817      0.139299
5         1  0.419195      0.272472
6         1  0.685220      0.309153
7         1  0.204452      0.384366
8         1  0.878117      0.354380
9         0  0.027388      0.371164

性能：已在10000个类别的10行上进行了测试：

Option 1: 7.81 ms ± 98.7 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)    
Option 2: 8.13 ms ± 248 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
Option 3: 5.96 ms ± 261 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

如何为另一个分类变量的每个值在熊猫中执行滚动均值？

1 个答案: