Question

我使用的语音识别软件效果不佳。

例如：fashion将返回mission或matches = { 'session': ['fashion', 'mission'], ... }。

现在我有一本字典：

{{1}}

我正在循环搜索所有单词以找到匹配。

我不介意误报，因为应用程序只接受一组有限的关键字。但是，为每个人手动输入新单词是很繁琐的。此外，每次说话时，语音识别器都会出现新词。

我也遇到了困难，因为一个较小的单词会返回一个长单词，所以上述方法不会起作用。

那么，nltk中是否有内置方法来执行此操作？或者甚至是我自己写的更好的算法？

Answer 1

你可能想看看python-Levenshtein。它是一个用于计算字符串距离/相似度的python C扩展模块。

像这种愚蠢低效的代码可能有用：

from Levenshtein import jaro_winkler  # May not be module name

heard_words = "brain"
possible_words = ["watermelon", "brian"]

word_scores = [jaro-winkler(heard_word, possible) for possible in possible_words]
guessed_word = possible_words[word_scores.index(max(word_scores))]

print('I heard {0} and guessed {1}'.format(heard_word, guessed_word))

以下是documentation和非维护repo。

Answer 2

您可以使用 fuzzywuzzy ，这是一个用于模糊匹配单词和字符串的python包。

安装包。

pip install fuzzywuzzy

与您的问题相关的示例代码。

from fuzzywuzzy import fuzz

MIN_MATCH_SCORE = 80

heard_word = "brain"

possible_words = ["watermelon", "brian"]

guessed_word = [word for word in possible_words if fuzz.ratio(heard_word, word) >= MIN_MATCH_SCORE]

print 'I heard {0} and guessed {1}'.format(heard_word, guessed_word)

以下是 fuzzywuzzy 的documentation and repo。

在nltk中是否有内置方法来查找与给定单词紧密匹配的单词/短语？

2 个答案: