Question

我想在具有特定百分比的字符串列表中找到唯一的字符串（在Python中）。但是，这些字符串应该有很大不同。如果两个字符串之间存在细微差别，那么对我来说它并不有趣。

我可以遍历字符串以找到它们的相似百分比，但我想知道是否有更好的方法来做到这一点？

例如，

String A: He is going to school. 
String B: He is going to school tomorrow.

让我们说这两个字符串是80％相似的。

相似性：具有相同顺序的相同单词的字符串最相似。字符串可以与自身100％相似

它的定义有点模糊，但它适用于我的用例。

Answer 1

如果你想检查两个句子相似的数量，并且你想知道它们何时是完全相同的单词排序，那么你可以使用单句BLEU得分。

我会使用此处的sentence_bleu：http://www.nltk.org/_modules/nltk/translate/bleu_score.html

你需要确保你对短句的重量做了些什么。我过去做过的一个例子是

from nltk.translate.bleu_score import sentence_bleu
from nltk import word_tokenize

sentence1 = "He is a dog"
sentence2 = "She is a dog"

reference = word_tokenize(sentence1.lower())
hypothesis = word_tokenize(sentence2.lower())
if min(len(hypothesis), len(reference)) < 4:
        weighting = 1.0 / min(len(hypothesis), len(reference))
        weights = tuple([weighting] * min(len(hypothesis), len(reference)))
else:
    weights = (0.25, 0.25, 0.25, 0.25)
bleu_score = sentence_bleu([reference], hypothesis, weights=weights)

请注意，单句BLEU在检测具有不同单词排序的相似句子时非常糟糕。所以，如果这是你感兴趣的话，那就要小心了。您可以尝试的其他方法是文档相似性，Jaccard相似性和余弦相似性。

查找多个字符串之间共享令牌的百分比（百分比相似性）

1 个答案: