Question

这是一个代码片段，用于模拟我面临的问题。我在大型数据集上使用迭代

df = pd.DataFrame({'grp':np.random.choice([1,2,3,4,5],500),'col1':np.arange(0,500),'col2':np.random.randint(0,10,500),'col3':np.nan})

for index, row in df.iterrows():
    #based on group label, get last 3 values to calculate mean
    d=df.iloc[0:index].groupby('grp')
    try:
        dgrp_sum=d.get_group(row.grp).col2.tail(3).mean()
    except:
        dgrp_sum=999
    #after getting last 3 values of group with reference to current row reference, multiply by other rows
    df.at[index,'col3']=dgrp_sum*row.col1*row.col2

如果我想使用向量加快速度，该如何转换此代码？

Answer 1

基本上，您可以计算每个组的移动平均线。这意味着您可以按“ grp”对数据帧进行分组并计算滚动平均值。最后，您将每一行中的列相乘，因为它不依赖于组。

df["col3"] = df.groupby("grp").col2.rolling(3, min_periods=1).mean().reset_index(0,drop=True)  
df["col3"] = df[["col1", "col2", "col3"]].product(axis=1)

注意：在您的代码中，每个计算出的均值都放在下一行，这就是为什么您可能有这个try块的原因。

# Skipping last product gives only mean
# np.random.seed(1234)
# print(df[df["grp"] == 2])
     grp  col1  col2        iter      mask
4      2     4     6  999.000000  6.000000
5      2     5     0    6.000000  3.000000
6      2     6     9    3.000000  5.000000
17     2    17     1    5.000000  3.333333
27     2    27     9    3.333333  6.333333

熊猫-如何通过计算而不是迭代对向量分组

1 个答案: