用向量化函数替换慢熊猫循环

时间:2020-10-09 21:33:08

标签: python pandas performance

我的熊猫循环非常慢(十多分钟)。我正在尝试将其替换为矢量化函数,但想不出要使用什么。有多个具有不同家庭编号但关系组编号相同的记录,如果一条记录的家庭编号与该关系组编号相同,那么我要对该关系组的所有记录使用该记录的人员编号和名称号码(包括家庭号码不同)。参见下面的代码:

        rg['RG Officer Number'] = pd.np.nan
        rg['RG Officer Name'] = pd.np.nan
        for index, row in rg.iterrows():
            if row['Relationship Group'] == row['Household Number']:
                mask = rg['Relationship Group'] == row['Relationship Group']
                rg.loc[mask, 'RG Officer Number'] = row['Household Primary Officer Number']
                rg.loc[mask, 'RG Officer Name'] = row['Household Primary Officer Name'] 

我尝试了以下操作,但出现错误(无法使用单个布尔值索引setitem)。我认为我完全偏离了轨道。使用矢量化函数,也许这是不可能的,但似乎不应该。

        mask = row['Relationship Group'] == row['Household Number']
        rg.loc[mask, 'RG Officer Number'] = rg.loc['Household Primary Officer Number']

您提供的任何帮助将不胜感激。

1 个答案:

答案 0 :(得分:1)

过滤器和合并将起作用。

df = pd.DataFrame({'Household Number':[str(i) for i in range(10)],
                   'Relationship Number':[str(i) for i in range(5)]*2,
                   'RG Officer Number':np.random.randint(1,100,10),
                   'RG Officer Name':['name'+str(i) for i in np.random.randint(1,100,10)]})

df
#  Household Number Relationship Number  RG Officer Number RG Officer Name
#0                0                   0                 28          name87
#1                1                   1                 18          name71
#2                2                   2                 69           name8
#3                3                   3                 83          name64
#4                4                   4                 88          name36
#5                5                   0                 25          name89
#6                6                   1                 51          name76
#7                7                   2                 29          name80
#8                8                   3                 61          name27
#9                9                   4                  2          name95


df_filtered = df.loc[df['Household Number'] == df['Relationship Number']]
df_filtered
#  Household Number Relationship Number  RG Officer Number RG Officer Name
#0                0                   0                 28          name87
#1                1                   1                 18          name71
#2                2                   2                 69           name8
#3                3                   3                 83          name64
#4                4                   4                 88          name36

df_merged = pd.merge(left=df,right=df_filtered[['Relationship Number','RG Officer Number','RG Officer Name']],
                     how='left',
                     on='Relationship Number',suffixes=('_old','_new'))

这是合并的数据。 df_merged