根据其他单元格值填充数据框中的缺失值

时间:2018-03-24 08:55:07

标签: python pandas dataframe

我有一个很大的名单,我试图剔除重复项。我按名称对它们进行分组,并在需要时合并信息。

当两个人没有相同的名字时,没有问题,我们可以填写和填写,但是,如果两个人有相同的名字,我们需要做一些额外的检查

这是一个小组的例子:

             name        code            id country     yob
1137  Bobby Joe   USA19921111           NaN     NaN     NaN
2367  Bobby Joe           NaN    1223133121     USA    1992
4398  Bobby Joe   USA19981111           NaN     NaN     NaN

该代码包含人员国家和生日。看着它,我们可以看到第一排和第二排是同一个人。所以我们需要将第二行的信息填充到第一行:

             name        code            id country     yob
1137  Bobby Joe   USA19921111    1223133121     USA    1992
4398  Bobby Joe   USA19981111           NaN     NaN     NaN

这就是我所拥有的:

# Get create a dictionry of all of the rows that contain
# codes and their indexes
code_rows = dict(zip(list(group['code'].dropna().index),
                     group['code'].dropna().unique()))
no_code_rows = group.loc[pd.isnull(group['code']), :]

if no_code_rows.empty or len(code_rows) == group.shape[0]:
    # No info to consolidate
    return group

for group_idx, code in code_rows.items():
    for row_idx, row in no_code_rows.iterrows():
        country_yob = row['country'] + str(int(row['yob']))
        if country_yob in code:
            group.loc[group_idx, 'id'] = row['id']
            group.loc[group_idx, 'country'] = row['country']
            group.loc[group_idx, 'yob'] = row['yob']
            group.drop(row_idx, inplace=True)
            # Drop from temp table so we don't have to iterate 
            # over an extra row
            no_code_rows.drop(row_idx, inplace=True)'''
            break

return group

这有效,但我有一种感觉我错过了什么?我觉得我不应该为此使用两个循环,也许有一个pandas功能?

修改

我们不知道每个组中的订单或行数

       name             code            id country     yob
1137 Bobby Joe   USA19921111           NaN     NaN     NaN
2367 Bobby Joe   USA19981111           NaN     NaN     NaN
4398 Bobby Joe           NaN    1223133121     USA    1992`

1 个答案:

答案 0 :(得分:0)

我认为需要:

m = df['code'].isnull()
df1 = df[~m]
df2 = df[m]

df = df1.merge(df2, on='name', suffixes=('','_'))
df['a_'] = df['country_'] + df['yob_'].astype(str)
m = df.apply(lambda x: x['a_'] in x['code'], axis=1)
df.loc[m, ['id','country','yob']] = df.loc[m, ['id_','country_','yob_']].rename(columns=lambda x: x.strip('_'))
df = df.loc[:, ~df.columns.str.endswith('_')]
print (df)
        name         code          id country   yob
0  Bobby Joe  USA19921111  1223133121     USA  1992
1  Bobby Joe  USA19981111         NaN     NaN   NaN
相关问题