Question

我必须使用数据框 - df和df1

df位于

之下

Facility    Category ID   Part  Text
Centennial  History 11111   A   Drain
Centennial  History 11111   B   Read
Centennial  History 11111   C   EKG
Centennial  History 11111   D   Assistant 
Centennial  History 11111   E   Primary

df1位于下方（仅包含问题的小样本，实际上是50,000行）

Facility    Category  ID      Part   Text
Centennial  History  11111    D      Assistant

基本上我想比较数据框之间的行，如果行匹配两个数据框，则在第一个数据框df中创建另一列，列标题为['MatchingFlag']

我的最终结果数据框，我想在下面看起来像我一样关注那些不匹配的数据。

Facility    Category  ID    Part    Text      MatchingFlag
Centennial  History  11111  A     Drain         No
Centennial  History  11111  B     Read          No
Centennial  History  11111  C     EKG           No
Centennial  History  11111  D     Assistant     Yes
Centennial  History  11111  E     Primary       No

有关如何执行此操作的任何帮助？我尝试合并df = pd.merge(df1, df, how='left', on=['Facility', 'Category', 'ID', 'Part', 'Text'])这两个数据帧，然后根据空白或NaN值创建一个标记，但这并不是我希望的那样。

Answer 1

在您想要匹配的列上设置索引可能是有意义的，并使用该索引来排序哪些行匹配

columns = ['Facility', 'Category', 'ID', 'Part', 'Text']

# It's always a good idea to sort after creating a MultiIndex like this
df = df.set_index(columns).sortlevel()
df1 = df1.set_index(columns).sortlevel()

# You don't have to use Yes here, anything will do
# The boolean True might be more appropriate
df['MatchingFlag'] = "Yes"
df1['MatchingFlag'] = "Yes"

# Add them together, matching rows will have the value "YesYes"
# Non-matches will be nan
result = df + df1

# If you'd rather not have NaN's 
result.loc[:,'MatchingFlag'] = result.loc[:,'MatchingFlag'].replace('YesYes','Yes')
result.loc[:,'MatchingFlag'] = result['MatchingFlag'].fillna('No')

Pandas在两个DataFrame之间进行比较，标记匹配的内容

1 个答案: