根据条件将缺失的行从一个数据框添加到另一个数据框

时间:2020-07-03 09:10:54

标签: python pandas dataframe

我的示例数据如下:

data1 = {'index':  ['001', '001', '001', '002', '002', '003', '004','004'],
        'type' : ['red', 'red', 'red', 'yellow', 'red', 'green', 'blue', 'blue'],
        'class' : ['A', 'A', 'A', 'A', 'A', 'A', 'A', 'A']}
df1 = pd.DataFrame (data1, columns = ['index', 'type', 'class']) 
df1
    index   type    class
0   001     red     A
1   001     red     A
2   001     red     A
3   002     yellow  A
4   002     red     A
5   003     green   A
6   004     blue    A
7   004     blue    A

data2 = {'index':  ['001', '001', '002', '003', '004'],
        'type' : ['red', 'red', 'yellow', 'green', 'blue'],
        'class' : ['A', 'A', 'A', 'B', 'A'],
        'outcome': ['in', 'in', 'out', 'in', 'out']}
df2 = pd.DataFrame (data2, columns = ['index', 'type', 'class', 'outcome']) 
df2
    index   type    class   outcome
0   001     red     A       in
1   001     red     A       in
2   002     yellow  A       out
3   003     green   B       in
4   004     blue    A       out

df1中,class = Adf2中可以是ABC。我想在df2中的df1中添加缺失的行。 df1具有每个索引的类型计数。例如,如果在df1中索引001出现3次,则意味着我也应该在df2中使其索引3次。对于df1中不在df2中的行,第outcome列应等于NaN。输出应为:

    index   type    class   outcome
0   001     red     A       in
1   001     red     A       in
2   001     red     A       NaN
3   002     yellow  A       out
4   002     red     A       NaN
5   003     green   A       NaN
6   003     green   B       in
7   004     blue    A       out
8   004     blue    A       NaN

我尝试使用pd.concat和pd.merge,但是我一直在重复或添加错误的行。有人对如何执行此操作有想法吗?

2 个答案:

答案 0 :(得分:1)

使用GroupBy.cumcount作为唯一性的计数器值,因此可以在下一步中使用DataFrame.merge的外部联接:

df1['group'] = df1.groupby(['index','type','class']).cumcount()
df2['group'] = df2.groupby(['index','type','class']).cumcount()

df = (df1.merge(df2, on=['index','type','class','group'], how='outer')
         .sort_values(by=['index', 'class'])
         .drop(columns='group'))
print (df)
  index    type class outcome
0   001     red     A      in
1   001     red     A      in
2   001     red     A     NaN
3   002  yellow     A     out
4   002     red     A     NaN
5   003   green     A     NaN
8   003   green     B      in
6   004    blue     A     out
7   004    blue     A     NaN

答案 1 :(得分:1)

df1['index_id'] = df1.groupby('index').cumcount()
df2['index_id'] = df2.groupby('index').cumcount()

merged = (
    df2
    .merge(df1, how='outer', on=['index', 'type', 'class', 'index_id'])
    .sort_values(by=['index', 'class'])
    .reset_index(drop=True)
    .drop(columns='index_id')
)

print(merged)
    index   type  class outcome
0   001     red    A    in
1   001     red    A    in
2   001     red    A    NaN
3   002     yellow A    out
4   002     red    A    NaN
5   003     green  A    NaN
6   003     green  B    in
7   004     blue   A    out
8   004     blue   A    NaN