Question

我想比较两个数据帧，即df1和df2。 df1是一个每小时更新一次的数据。 df2是一个存在的数据帧。我想追加更新的特定行。

例如，这是df1

DF1：

fd1

包含5行信息

和已存在的df2

DF2：

df2

我们可以告诉我们添加了eric，但是df2并没有代表那个。

我可以用df1覆盖df2，但我不应该因为有人会在以后写入数据时更新。

所以，我决定通过它的id从df2中找到它来删除每一行数据，并用for循环删除它们

之后，将只剩下eric的行，这将使我可以将eric附加到df2。

所以我试过的是这个

for index, row in df1.iterrows():
    id = row['id']
    if df2.loc[df1['id'].isin(id)] = True:
        df1[df1.id != id)

并返回语法错误....

我正走在正确的轨道上吗？它是解决这个问题的最佳解决方案吗？我应该如何更改代码以实现我的目标？

Answer 1

修复你的代码...

l=[]
for index, row in df1.iterrows():
    id = row['Id']
    if sum(df2['Id'].isin([id]))>0:
        l.append(id)
l
Out[334]: [0, 1, 2, 3, 4] # those are the row you need to remove 

df1.loc[~df1.index.isin(l)]# you remove them by using `~` + .isin 
Out[339]: 
   Id Name
5   5    F
6   6    G

使用pd.concat

pd.concat([df2,df1[~df1.Id.isin(df2.Id)]],axis=0)
Out[337]: 
   Id Name
0   0    A
1   1    B
2   2    C
3   3    D
4   4    E
5   5    F
6   6    G

数据输入

fake = {'Id' : [0,1,2,3,4,5,6],
         'Name' : ['A','B','C','D','E','F','G']}
df1 = pd.DataFrame(fake)

fake = {'Id' : [0,1,2,3,4],
         'Name' : ['A','B','C','D','E']}
df2 = pd.DataFrame(fake)

Answer 2

我们假设'steve'有一条评论我们要保留在df1中，而'jack'有一条评论我们要保存在df2中。我们可以将每个数据框的索引设置为['id', 'name']并使用pd.Series.combine_first

设置

df1 = pd.DataFrame(dict(
    id=[12, 34, 56, 78, 90, 13],
    name='james steve jack ted eric bob'.split(),
    remark='',
))
df1.at[1, 'remark'] = 'meh'

df2 = pd.DataFrame(dict(
    id=[12, 34, 56, 78, 13],
    name='james steve jack ted bob'.split(),
    remark='',
))
df2.at[2, 'remark'] = 'smart'

解决方案

s1 = df1.set_index(['id', 'name']).remark
s2 = df2.set_index(['id', 'name']).remark

s1.mask(s1.eq('')).combine_first(s2.mask(s2.eq(''))).fillna('').reset_index()

   id   name remark
0  12  james       
1  13    bob       
2  34  steve    meh
3  56   jack  smart
4  78    ted       
5  90   eric

但是，假设它与OP完全一样！

设置

df1 = pd.DataFrame(dict(
    id=[12, 34, 56, 78, 90, 13],
    name='james steve jack ted eric bob'.split(),
    remark='',
))

df2 = pd.DataFrame(dict(
    id=[12, 34, 56, 78, 13],
    name='james steve jack ted bob'.split(),
    remark='',
))
df2.at[2, 'remark'] = 'smart'

解决方案

df2.append(df1).drop_duplicates(['id', 'name']).reset_index(drop=True)

   id   name remark
0  12  james       
1  34  steve       
2  56   jack  smart
3  78    ted       
4  13    bob       
5  90   eric

Answer 3

Pandas有几个可用的功能，允许合并和连接不同的DataFrame。您可以在此处使用的是merge：https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.merge.html

>>>merged = df1.merge(df2, how='left')
    id   name remark
0  234  james       
1  212  steve       
2  153   jack  smart
3  567    ted       
4  432   eric    NaN
5  543    bob

如果您不希望插入的值为NaN，则可以始终使用fillna：https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.fillna.html。

python pandas dataframe查找包含特定值的行并返回布尔值

3 个答案: