如何在集合中找到相似的单词?

时间:2016-03-22 00:43:23

标签: python string list set

word = "work" word_set = {"word","look","wrap","pork"}

我怎样才能找到相似的单词,以便" word" "猪肉"只需要一个字母就可以改成" work"?

我想知道如果有一种方法可以找到字符串和集合中的项目之间的区别。

3 个答案:

答案 0 :(得分:3)

使用标准库中的difflib.get_close_matches()

import difflib

word = "work"
word_set = {"word","look","wrap","pork"}

difflib.get_close_matches(word, word_set)

返回:

['word', 'pork']

编辑如果需要,difflib.SequenceMatcher.get_opcodes()可用于计算编辑距离:

matcher = difflib.SequenceMatcher(b=word)
for test_word in word_set:
    matcher.set_seq1(test_word)
    distance = len([m for m in matcher.get_opcodes() if m[0]!='equal'])
    print(distance, test_word)

答案 1 :(得分:0)

您可以执行以下操作:

word = "work"
word_set = set(["word","look","wrap","pork"])

for example in word_set:
    if len(example) != len(word):
        continue
    num_chars_out = sum([1 for c1,c2 in zip(example, word) if c1 != c2])
    if num_chars_out == 1:
        print(example)

答案 2 :(得分:0)

我建议使用editdistance Python package,它提供wget http://www.dli.gov.in/scripts/FullindexDefault.htm?path1=/data7/upload/0180/365&first=35&last=479&barcode=2030020017599 函数,用于计算从第一个单词到第二个单词需要更改的字符数。编辑距离与Levenshtein距离相同,这是由MattDMo建议的。

在您的情况下,如果您想识别彼此相距1个编辑距离的单词,您可以执行以下操作:

editdistance.eval

import editdistance as ed thresh = 1 w1 = "work" word_set = set(["word","look","wrap","pork"]) neighboring_words = [w2 for w2 in word_set if ed.eval(w1, w2) <= thresh] print neighboring_words 评估为neighboring_words