我一直在寻找答案,但找不到。我有两个数据框,一个是 target
,另一个是 backup
,它们都具有相同的列。我想要做的是查看给定的列并添加从 backup
到 target
而不是 target
的所有行。对此最直接的解决方案是:
import pandas as pd
import numpy as np
target = pd.DataFrame({
"key1": ["K1", "K2", "K3", "K5"],
"A": ["A1", "A2", "A3", np.nan],
"B": ["B1", "B2", "B3", "B5"],
})
backup = pd.DataFrame({
"key1": ["K1", "K2", "K3", "K4", "K5"],
"A": ["A1", "A", "A3", "A4", "A5"],
"B": ["B1", "B2", "B3", "B4", "B5"],
})
merged = target.copy()
for item in backup.key1.unique():
if item not in target.key1.unique():
merged = pd.concat([merged, backup.loc[backup.key1 == item]])
merged.reset_index(drop=True, inplace=True)
给予
key1 A B
0 K1 A1 B1
1 K2 A2 B2
2 K3 A3 B3
3 K5 NaN B5
4 K4 A4 B4
现在我已经尝试了一些只使用熊猫的东西,但它们都不起作用。
# Does not work because it creates duplicate lines and if dropped, the updated rows which are different will not be dropped -- compare the line with A or NaN
pd.concat([target, backup]).drop_duplicates()
key1 A B
0 K1 A1 B1
1 K2 A2 B2
2 K3 A3 B3
3 K5 NaN B5
1 K2 A B2
3 K4 A4 B4
4 K5 A5 B5
# Does not work because the backup would overwrite data in the target -- NaN
pd.merge(target, backup, how="right")
key1 A B
0 K1 A1 B1
1 K2 A B2
2 K3 A3 B3
3 K4 A4 B4
4 K5 A5 B5
重要的是,它不是 this post 的副本,因为我不想有一个新列,更重要的是,NaN
中的值不是 target
,它们根本不存在。此外,如果那时我会使用建议合并列的内容,NaN
中的 target
将替换为 backup
中不需要的值。
它不是使用 this post 熊猫的 combine_first 的副本,因为在这种情况下,NaN
由来自 backup
的值填充,它是错了:
target.combine_first(backup)
key1 A B
0 K1 A1 B1
1 K2 A2 B2
2 K3 A3 B3
3 K5 A4 B5
4 K5 A5 B5
target.join(backup, on=["key1"])
让我很烦
ValueError: You are trying to merge on object and int64 columns. If you wish to proceed you should use pd.concat
我真的不明白,因为两者都是纯字符串并且 proposed solution 不起作用。
所以我想问一下,我错过了什么?我如何使用一些 pandas
方法来做到这一点?非常感谢。
答案 0 :(得分:2)
将 concat
与过滤的 backup
行一起使用,而 target.key1
中不存在由 Series.isin
中的 boolean indexing
过滤:
merged = pd.concat([target, backup[~backup.key1.isin(target.key1)]])
print (merged)
key1 A B
0 K1 A1 B1
1 K2 A2 B2
2 K3 A3 B3
3 K5 NaN B5
3 K4 A4 B4
答案 1 :(得分:0)
也许您可以在 df.drop_duplicates()
中使用“子集”参数来尝试此操作?
pd.concat([target, backup]).drop_duplicates(subset = "key1")
给出输出:
key1 A B
0 K1 A1 B1
1 K2 A2 B2
2 K3 A3 B3
3 K5 NaN B5
3 K4 A4 B4