模糊字符串比较

时间:2012-04-30 11:37:20

标签: python nlp fuzzy-comparison

我正在努力完成的是一个程序,它读入一个文件并根据原始句子比较每个句子。与原文完全匹配的句子将得到1分,而总则相反的句子将得到0.所有其他模糊句子将得到1到0之间的分数。

我不确定使用哪种操作来允许我在Python 3中完成此操作。

我已经包含了示例文本,其中Text 1是原始文本,而其他前面的字符串是比较。

文字:样本

文字1:这是一个黑暗而暴风雨的夜晚。我独自一人坐在红色的椅子上。因为我养了三只猫,所以我并不孤单。

文字20:这是一个阴暗而暴风雨的夜晚。我独自一人坐在深红色的椅子上。因为我有三只猫,所以我不是一个人 //应该获得高分而不是1分

文字21:这是一个阴暗而暴躁的夜晚。我独自一人坐在深红色的座位上。因为我有三只猫,所以我不是一个人 //评分应低于文字20

文字22:我独自一人坐在深红色的教堂上。因为我有三只猫,所以我不是一个人。这是一个阴沉而暴躁的夜晚。 //应该低于文本21而不是0

文字24:这是一个黑暗而暴风雨的夜晚。我并不孤单。我没坐在红色的椅子上。我有三只猫。 //应该得分为0!

4 个答案:

答案 0 :(得分:89)

有一个名为fuzzywuzzy的包。通过pip安装:

pip install fuzzywuzzy

简单用法:

>>> from fuzzywuzzy import fuzz
>>> fuzz.ratio("this is a test", "this is a test!")
    96

该软件包构建在difflib之上。你问,为什么不用它呢?除了更简单之外,它还有许多不同的匹配方法(如令牌顺序不敏感,部分字符串匹配),这使得它在实践中更加强大。 process.extract函数特别有用:从集合中查找最匹配的字符串和比率。从他们的自述文件:

  

部分比率

>>> fuzz.partial_ratio("this is a test", "this is a test!")
    100
  

令牌排序比率

>>> fuzz.ratio("fuzzy wuzzy was a bear", "wuzzy fuzzy was a bear")
    90
>>> fuzz.token_sort_ratio("fuzzy wuzzy was a bear", "wuzzy fuzzy was a bear")
    100
  

令牌集比率

>>> fuzz.token_sort_ratio("fuzzy was a bear", "fuzzy fuzzy was a bear")
    84
>>> fuzz.token_set_ratio("fuzzy was a bear", "fuzzy fuzzy was a bear")
    100
  

过程

>>> choices = ["Atlanta Falcons", "New York Jets", "New York Giants", "Dallas Cowboys"]
>>> process.extract("new york jets", choices, limit=2)
    [('New York Jets', 100), ('New York Giants', 78)]
>>> process.extractOne("cowboys", choices)
    ("Dallas Cowboys", 90)

答案 1 :(得分:76)

标准库中有一个模块(称为difflib),可以比较字符串并根据它们的相似性返回分数。 SequenceMatcher课程应该完成您的工作。

编辑:来自python提示符的小例子:

>>> from difflib import SequenceMatcher as SM
>>> s1 = ' It was a dark and stormy night. I was all alone sitting on a red chair. I was not completely alone as I had three cats.'
>>> s2 = ' It was a murky and stormy night. I was all alone sitting on a crimson chair. I was not completely alone as I had three felines.'
>>> SM(None, s1, s2).ratio()
0.9112903225806451

HTH!

答案 2 :(得分:13)

fuzzyset is much faster than fuzzywuzzy (difflib) for both indexing and searching.

from fuzzyset import FuzzySet
corpus = """It was a murky and stormy night. I was all alone sitting on a crimson chair. I was not completely alone as I had three felines
    It was a murky and tempestuous night. I was all alone sitting on a crimson cathedra. I was not completely alone as I had three felines
    I was all alone sitting on a crimson cathedra. I was not completely alone as I had three felines. It was a murky and tempestuous night.
    It was a dark and stormy night. I was not alone. I was not sitting on a red chair. I had three cats."""
corpus = [line.lstrip() for line in corpus.split("\n")]
fs = FuzzySet(corpus)
query = "It was a dark and stormy night. I was all alone sitting on a red chair. I was not completely alone as I had three cats."
fs.get(query)
# [(0.873015873015873, 'It was a murky and stormy night. I was all alone sitting on a crimson chair. I was not completely alone as I had three felines')]

Warning: Be careful not to mix unicode and bytes in your fuzzyset.

答案 3 :(得分:1)

该任务被称为Paraphrase Identification,它是自然语言处理研究的一个活跃领域。我已经链接了几篇最先进的论文,其中很多都可以在GitHub上找到开源代码。

请注意,所有已回答的问题都假设两个句子之间存在一些字符串/表面相似性,而实际上两个字符串相似性较小的句子在语义上可能相似。

如果您对这种相似性感兴趣,可以使用Skip-Thoughts。 根据GitHub指南安装软件,然后转到自述文件中的释义检测部分:

import skipthoughts
model = skipthoughts.load_model()
vectors = skipthoughts.encode(model, X_sentences)

这会将您的句子(X_sentences)转换为矢量。之后您可以通过以下方式找到两个向量的相似性:

similarity = 1 - scipy.spatial.distance.cosine(vectors[0], vectors[1])

我们假设vector [0]和vector 1是X_sentences [0],X_sentences 1的相应向量,你想找到它们的分数。

还有其他模型可以将句子转换为矢量,您可以找到here

将句子转换为矢量后,相似性只是找到这些矢量之间的余弦相似性。