我有一个很大的名单,我试图剔除重复项。我按名称对它们进行分组,并在需要时合并信息。
当两个人没有相同的名字时,没有问题,我们可以填写和填写,但是,如果两个人有相同的名字,我们需要做一些额外的检查
这是一个小组的例子:
name code id country yob
1137 Bobby Joe USA19921111 NaN NaN NaN
2367 Bobby Joe NaN 1223133121 USA 1992
4398 Bobby Joe USA19981111 NaN NaN NaN
该代码包含人员国家和生日。看着它,我们可以看到第一排和第二排是同一个人。所以我们需要将第二行的信息填充到第一行:
name code id country yob
1137 Bobby Joe USA19921111 1223133121 USA 1992
4398 Bobby Joe USA19981111 NaN NaN NaN
这就是我所拥有的:
# Get create a dictionry of all of the rows that contain
# codes and their indexes
code_rows = dict(zip(list(group['code'].dropna().index),
group['code'].dropna().unique()))
no_code_rows = group.loc[pd.isnull(group['code']), :]
if no_code_rows.empty or len(code_rows) == group.shape[0]:
# No info to consolidate
return group
for group_idx, code in code_rows.items():
for row_idx, row in no_code_rows.iterrows():
country_yob = row['country'] + str(int(row['yob']))
if country_yob in code:
group.loc[group_idx, 'id'] = row['id']
group.loc[group_idx, 'country'] = row['country']
group.loc[group_idx, 'yob'] = row['yob']
group.drop(row_idx, inplace=True)
# Drop from temp table so we don't have to iterate
# over an extra row
no_code_rows.drop(row_idx, inplace=True)'''
break
return group
这有效,但我有一种感觉我错过了什么?我觉得我不应该为此使用两个循环,也许有一个pandas功能?
修改
我们不知道每个组中的订单或行数
即
name code id country yob
1137 Bobby Joe USA19921111 NaN NaN NaN
2367 Bobby Joe USA19981111 NaN NaN NaN
4398 Bobby Joe NaN 1223133121 USA 1992`
答案 0 :(得分:0)
我认为需要:
m = df['code'].isnull()
df1 = df[~m]
df2 = df[m]
df = df1.merge(df2, on='name', suffixes=('','_'))
df['a_'] = df['country_'] + df['yob_'].astype(str)
m = df.apply(lambda x: x['a_'] in x['code'], axis=1)
df.loc[m, ['id','country','yob']] = df.loc[m, ['id_','country_','yob_']].rename(columns=lambda x: x.strip('_'))
df = df.loc[:, ~df.columns.str.endswith('_')]
print (df)
name code id country yob
0 Bobby Joe USA19921111 1223133121 USA 1992
1 Bobby Joe USA19981111 NaN NaN NaN