我一直致力于使用maxmatch算法来标记主题标签,与nltk中的单词列表进行比较,但我在调试方面遇到了麻烦。
算法的要点如下:
function MAXMATCH (sentence, dictionary D) returns word sequence W
if sentence is empty
return empty list
for i ← length(sentence) downto 1
firstword = first i chars of sentence
remainder = rest of sentence
if InDictionary(firstword, D)
return list(firstword, MaxMatch(remainder,dictionary) )
# no word was found, so make a one-character word
firstword = first char of sentence
remainder = rest of sentence
return list(firstword, MaxMatch(remainder,dictionary) )
以下是我的python实现。
我插入了一些print
试图在这里和那里进行调试。
from nltk.corpus import words # words is a Python list
wordlist = set(words.words())
lst = []
def max_match(hashtag, wordlist):
if not hashtag:
return None
for i in range(len(hashtag)-1, -1, -1):
first_word = (hashtag[0:i+1])
print "Firstword: " + first_word
remainder = hashtag[i+1:len(hashtag)]
print "Remainder: " + remainder
if first_word in wordlist:
print "Found: " + first_word
lst.append(first_word)
print lst
max_match(remainder, wordlist)
# if no word is found, make one-character word
first_word = hashtag[0]
remainder = hashtag[1:len(hashtag)]
lst.append(first_word)
max_match(remainder, wordlist)
return lst
print max_match('labourvictory', wordlist)
最后一行,print max_match('labourvictory', wordlist)
应该返回列表['人工','胜利']我希望它因if not hashtag return None
部分而退出,但由于理由我不理解它继续在所有的地狱都破裂了。
我在这里做错了什么?
答案 0 :(得分:0)
在递归函数中,最常见的bug不是在正确的点返回值。我按照给定的伪代码要点修改了你的代码。您的代码中的问题是,当您在字典中找到单词时,您不会返回任何值。
def max_match(hashtag, wordlist):
if not hashtag:
return []
for i in range(len(hashtag)-1, -1, -1):
first_word = (hashtag[0:i+1])
remainder = hashtag[i+1:len(hashtag)]
if first_word in wordlist:
return [first_word] + max_match(remainder, wordlist)
# if no word is found, make one-character word
first_word = hashtag[0]
remainder = hashtag[1:len(hashtag)]
return [first_word] + max_match(remainder, wordlist)