Question

我有以下数据框

Column1         Column2
tomato fruit    tomatoes are not a fruit
potato la best  potatoe are some sort of fruit
apple           there are great benefits to appel
pear            peer

我想查看左边的单词/句子和右边的句子，如果最后两个单词匹配（例如'potato la'并省略'best'）那么它会得分。

我已经使用了两种不同的方法：

for i in range(0, len(Column1)):
     store_it = SM(None, Column1[i], Column2[i]).get_matching_blocks()
     print(store_it)

并且

df['diff'] = df.apply(lambda x: diff.SequenceMatcher(None, x[0].strip(), x[1].strip()).ratio(), axis=1)

我在互联网上找到了。

第二个工作正常，但它试图匹配整个短语。如何将第一列中的单词与第二列中的单词匹配，以便最终给出句子（或部分）中的“是”或者不是“否”。

Answer 1

在此方法上使用FuzzyWuzzy的局部比率取得了最大的成功。它将为您提供第1列“西红柿果实”和第2列“西红柿不是果实”以及其余各列之间的部分匹配百分比。查看结果：

from fuzzywuzzy import fuzz
import difflib

df['fuzz_partial_ratio'] = df.apply(lambda x: fuzz.partial_ratio(x['Column1'], x['Column2']), axis=1)

df['sequence_ratio'] = df.apply(lambda x: difflib.SequenceMatcher(None, x['Column1'], x['Column2']).ratio(), axis=1)

您可以认为任何> 60的FuzzyWuzzy分数都是很好的部分匹配，即是的，列1中的单词最有可能出现在列2中的句子中。

第1行得分67，第2行得分71，第3行得分80，第4行得分75

Answer 2

使用set()：

Python » Documentation
  issubset(other)
  设置＆lt; =其他
      测试集合中的每个元素是否都在其他元素中。

例如：

c_set1 = set(Column1[i])
c_set2 = set(Column2[i])
if  c_set1.issubset(c_set2):
    # every in  c_set1 is in  c_set2

Difflib序列匹配器与句子

2 个答案: