我有2个数据集,使用df1中的数据,我想使用4个条件在df2中标识重复数据。
如果df1“名称”列中的一行与df2中“名称”列中的任何一行匹配超过80%
(AND)
(df1 ['Class'] == df2 ['Class'](OR)df1 ['Amt $'] == df2 ['Amt $'])
(AND)
如果df1中“类别”列中的行与df2中“类别”列中的任何行项目匹配超过80%
如果满足所有条件,则仅将新数据保留在df2中,然后删除其他行。
df1
Name Class Amt $ Category
Apple 1 5 Fruit
Banana 2 8 Fruit
Cat 3 4 Animal
df2
Index Name Class Amt $ Category
1 Apple is Red 1 5 Fruit
2 Banana 2 8 fruits
3 Cat is cute 3 4 animals
4 Green Apple 1 5 fruis
5 Banana is Yellow 2 8 fruet
6 Cat 3 4 anemal
7 Apple 1 5 anemal
8 Ripe Banana 2 8 frut
9 Royal Gala Apple 1 5 Fruit
10 Cats 3 4 animol
11 Green Banana 2 8 Fruit
12 Green Apple 1 5 fruits
13 White Cat 3 4 Animal
14 Banana is sweet 2 8 appel
15 Apple is Red 1 5 fruits
16 Ginger Cat 3 4 fruits
17 Cat house 3 4 animals
18 Royal Gala Apple 1 5 fret
19 Banana is Yellow 2 8 fruit market
20 Cat is cute 3 4 anemal
for i in df1['Name']:
for u in df2['Name']:
for k in df1['Class']:
for l in df2['Class']:
for m in df1['Amt $']:
for n in df2['Amt $']:
for o in df1['Category']:
for p in df2['Category']:
if SequenceMatcher(None, i, u).ratio() > .8 and k == l and m == n and SequenceMatcher(None, o, p).ratio() > 0.8:
print(i, u)
所需的输出数据框应如下所示:
Name Class Amt $ Category
Apple is Red 1 5 Fruit
Banana 2 8 fruits
Cat is cute 3 4 animals
Green Apple 1 5 fruis
Banana is Yellow 2 8 fruet
Cat 3 4 anemal
Ripe Banana 2 8 frut
Royal Gala Apple 1 5 Fruit
Cats 3 4 animol
Green Banana 2 8 Fruit
Green Apple 1 5 fruits
White Cat 3 4 Animal
Apple is Red 1 5 fruits
Cat house 3 4 animals
Banana is Yellow 2 8 fruit market
Cat is cute 3 4 anemal
请帮助我提供最佳解决方案! :)
答案 0 :(得分:1)
首先,您必须遍历两个df并使用条件进行匹配,并在df2中设置一个变量。
df2['match'] = False
for idx2, row2 in df2.iterrows():
match = False
for idx1, row1 in df1.iterrows():
if (SequenceMatcher(None, row1['Name'], row2['Name']).ratio())>=0.8 and \
(SequenceMatcher(None, row1['Category'], row2['Category']).ratio())>=0.8 and \
(row1['Class'] == row2['Class'] or row1['Amt $'] == row2['Amt $']):
match = True
break
df2.at[idx2, 'match'] = match
一旦有了匹配项,就从匹配项df2['match']==True
中删除重复项。
df2[df2['match']==True].drop_duplicates(keep='first')
接下来,您可以将以上结果与不匹配的df2['match']==False
df2[df2['match']==False].append(df2[df2['match']==True].drop_duplicates(keep='first'))
在这里,我假设您想删除直接重复项。您要根据条件删除重复项还是直接重复项?
根据测试数据集,此处的“ Apple”和“ Apple is red”匹配率为80%。但是SequenceMatcher(None, 'Apple', 'Apple is Red').ratio()
仅给出0.5882352941176471。同样,SequenceMatcher(None, 'Fruit', 'fruits').ratio()
仅为0.7272727272727273。您在这里还有其他期望吗?还是预期的结果不正确?
无论如何,我希望这能使您对方法有所了解。
编辑1 如果要获取匹配的df1['Name']
。
我仅将df2['match']
重置为字符串而不是布尔值,并将df1['Name']
分配给df2['match']
,而不是将其分配给True
。然后在最后一个df中,我将具有df2
的{{1}}行和df2['match']==False
的非重复行连接起来。希望这可以帮助。
df2['match']==True