Question

我有一个系列类型对象，我必须应用一个函数，使用bigrams来纠正单词，以防它出现在另一个单词中。我创建了一个bigrams列表，根据频率对其进行排序（最高的是第一个）并称之为fdist。

bigrams = [b for l in text2 for b in zip(l.split(" ")[:-1], l.split(" ")[1:])]
freq = nltk.FreqDist(bigrams) #computes freq of occurrence
fdist = freq.keys() # sorted according to freq

接下来，我创建了一个接受每一行（“或句子”，“列表对象”）的函数，并使用二元组来决定是否进一步纠正它。

def bigram_corr(line): #function with input line(sentence)
    words = line.split() #split line into words
    for word1, word2 in zip(words[:-1], words[1:]): #generate 2 words at a time words 1,2 followed by 2,3 3,4 and so on
        for i,j in fdist: #iterate over bigrams
            if (word2==j) and (jf.levenshtein_distance(word1,i) < 3): #if 2nd words of both match, and 1st word is at an edit distance of 2 or 1, replace word with highest occurring bigram
               word1=i #replace
               return word1 #return word

问题是整个句子只返回一个单词，例如：
“东边的两个东西”被让我们取代。看起来进一步的迭代不起作用 word1的for循环，word2以这种方式工作：在第一次迭代中“Lts go”，最终将被“let”取代，因为“go”会更频繁地发生

在第二次迭代中“走向”。

在第3次迭代中“朝向”......依此类推。

我发现有一个小错误，请帮忙。

Answer 1

听起来你正在做word1 = i，期望这会修改words的内容。但这不会发生。如果您想修改words，则必须直接修改。使用enumerate跟踪word1的索引。

正如2rs2ts指出的那样，你早早回来了。如果您希望在找到第一个良好替换后终止内循环，break而不是返回。然后在函数结束时返回。

def bigram_corr(line): #function with input line(sentence)
    words = line.split() #split line into words
    for idx, (word1, word2) in enumerate(zip(words[:-1], words[1:])):
        for i,j in fdist: #iterate over bigrams
            if (word2==j) and (jf.levenshtein_distance(word1,i) < 3): #if 2nd words of both match, and 1st word is at an edit distance of 2 or 1, replace word with highest occurring bigram
                words[idx] = i
                break
    return " ".join(words)

Answer 2

return语句完全停止了该功能。我想你想要的是：

def bigram_corr(line):
    words = line.split()
    words_to_return = []
    for word1, word2 in zip(words[:-1], words[1:]):
        for i,j in fdist:
            if (word2==j) and (jf.levenshtein_distance(word1,i) < 3):
               words_to_return.append(i)
    return ' '.join(words_to_return)

这会将您处理过的每个单词放入一个列表中，然后用空格重新加入它们并返回整个字符串，因为您说了一些关于返回“整个句子”的内容。

我不确定你的代码的语义是否正确，因为我没有jf库或你正在使用的任何东西，因此我无法测试这段代码，所以这个可能会或可能不会完全解决您的问题。但这会有所帮助。

在Bigram Frequency，Python的基础上替换单词

2 个答案: