Question

我有2个数据集，使用df1中的数据，我想使用4个条件在df2中标识重复数据。

条件：

如果df1“名称”列中的一行与df2中“名称”列中的任何一行匹配超过80％

（AND）

（df1 ['Class'] == df2 ['Class']（OR）df1 ['Amt $'] == df2 ['Amt $']）

（AND）

如果df1中“类别”列中的行与df2中“类别”列中的任何行项目匹配超过80％

结果：

如果满足所有条件，则仅将新数据保留在df2中，然后删除其他行。

df1

Name    Class   Amt $   Category
Apple      1    5       Fruit
Banana     2    8       Fruit
Cat        3    4       Animal

df2

Index   Name              Class Amt $   Category
    1   Apple is Red       1    5       Fruit
    2   Banana             2    8       fruits
    3   Cat is cute        3    4       animals
    4   Green Apple        1    5       fruis
    5   Banana is Yellow   2    8       fruet
    6   Cat                3    4       anemal
    7   Apple              1    5       anemal
    8   Ripe Banana        2    8       frut
    9   Royal Gala Apple   1    5       Fruit
    10  Cats               3    4       animol
    11  Green Banana       2    8       Fruit
    12  Green Apple        1    5       fruits
    13  White Cat          3    4       Animal
    14  Banana is sweet    2    8       appel
    15  Apple is Red       1    5       fruits
    16  Ginger Cat         3    4       fruits
    17  Cat house          3    4       animals
    18  Royal Gala Apple   1    5       fret
    19  Banana is Yellow   2    8       fruit market
    20  Cat is cute        3    4       anemal

我尝试过的代码：

for i in df1['Name']:
    for u in df2['Name']:
        for k in df1['Class']:
            for l in df2['Class']:
                for m in df1['Amt $']:
                    for n in df2['Amt $']:
                        for o in df1['Category']:
                            for p in df2['Category']:
                                if SequenceMatcher(None, i, u).ratio() > .8 and k == l and m == n and SequenceMatcher(None, o, p).ratio() > 0.8:
                                    print(i, u)

所需的输出数据框应如下所示：

Name              Class Amt $   Category
Apple is Red        1   5       Fruit
Banana              2   8       fruits
Cat is cute         3   4       animals
Green Apple         1   5       fruis
Banana is Yellow    2   8       fruet
Cat                 3   4       anemal
Ripe Banana         2   8       frut
Royal Gala Apple    1   5       Fruit
Cats                3   4       animol
Green Banana        2   8       Fruit
Green Apple         1   5       fruits
White Cat           3   4       Animal
Apple is Red        1   5       fruits
Cat house           3   4       animals
Banana is Yellow    2   8       fruit market
Cat is cute         3   4       anemal

请帮助我提供最佳解决方案！：）

Answer 1

首先，您必须遍历两个df并使用条件进行匹配，并在df2中设置一个变量。

df2['match'] = False
for idx2, row2 in df2.iterrows():
    match = False
    for idx1, row1 in df1.iterrows():
        if (SequenceMatcher(None, row1['Name'], row2['Name']).ratio())>=0.8 and \
                (SequenceMatcher(None, row1['Category'], row2['Category']).ratio())>=0.8 and \
                (row1['Class'] == row2['Class'] or row1['Amt $'] == row2['Amt $']):
            match = True
            break
    df2.at[idx2, 'match'] = match

一旦有了匹配项，就从匹配项df2['match']==True中删除重复项。

df2[df2['match']==True].drop_duplicates(keep='first')

接下来，您可以将以上结果与不匹配的df2['match']==False

结合起来

df2[df2['match']==False].append(df2[df2['match']==True].drop_duplicates(keep='first'))

在这里，我假设您想删除直接重复项。您要根据条件删除重复项还是直接重复项？

根据测试数据集，此处的“ Apple”和“ Apple is red”匹配率为80％。但是SequenceMatcher(None, 'Apple', 'Apple is Red').ratio()仅给出0.5882352941176471。同样，SequenceMatcher(None, 'Fruit', 'fruits').ratio()仅为0.7272727272727273。您在这里还有其他期望吗？还是预期的结果不正确？

无论如何，我希望这能使您对方法有所了解。

编辑1 如果要获取匹配的df1['Name']。

我仅将df2['match']重置为字符串而不是布尔值，并将df1['Name']分配给df2['match']，而不是将其分配给True。然后在最后一个df中，我将具有df2的{{1}}行和df2['match']==False的非重复行连接起来。希望这可以帮助。

df2['match']==True

if和语句之间到熊猫数据帧

1 个答案: