我的熊猫循环非常慢(十多分钟)。我正在尝试将其替换为矢量化函数,但想不出要使用什么。有多个具有不同家庭编号但关系组编号相同的记录,如果一条记录的家庭编号与该关系组编号相同,那么我要对该关系组的所有记录使用该记录的人员编号和名称号码(包括家庭号码不同)。参见下面的代码:
rg['RG Officer Number'] = pd.np.nan
rg['RG Officer Name'] = pd.np.nan
for index, row in rg.iterrows():
if row['Relationship Group'] == row['Household Number']:
mask = rg['Relationship Group'] == row['Relationship Group']
rg.loc[mask, 'RG Officer Number'] = row['Household Primary Officer Number']
rg.loc[mask, 'RG Officer Name'] = row['Household Primary Officer Name']
我尝试了以下操作,但出现错误(无法使用单个布尔值索引setitem)。我认为我完全偏离了轨道。使用矢量化函数,也许这是不可能的,但似乎不应该。
mask = row['Relationship Group'] == row['Household Number']
rg.loc[mask, 'RG Officer Number'] = rg.loc['Household Primary Officer Number']
您提供的任何帮助将不胜感激。
答案 0 :(得分:1)
过滤器和合并将起作用。
df = pd.DataFrame({'Household Number':[str(i) for i in range(10)],
'Relationship Number':[str(i) for i in range(5)]*2,
'RG Officer Number':np.random.randint(1,100,10),
'RG Officer Name':['name'+str(i) for i in np.random.randint(1,100,10)]})
df
# Household Number Relationship Number RG Officer Number RG Officer Name
#0 0 0 28 name87
#1 1 1 18 name71
#2 2 2 69 name8
#3 3 3 83 name64
#4 4 4 88 name36
#5 5 0 25 name89
#6 6 1 51 name76
#7 7 2 29 name80
#8 8 3 61 name27
#9 9 4 2 name95
df_filtered = df.loc[df['Household Number'] == df['Relationship Number']]
df_filtered
# Household Number Relationship Number RG Officer Number RG Officer Name
#0 0 0 28 name87
#1 1 1 18 name71
#2 2 2 69 name8
#3 3 3 83 name64
#4 4 4 88 name36
df_merged = pd.merge(left=df,right=df_filtered[['Relationship Number','RG Officer Number','RG Officer Name']],
how='left',
on='Relationship Number',suffixes=('_old','_new'))