我查找了所有建议,每个人都说要通过拆分函数将字符串分解为标记。所有这些已经完成,但似乎仍然一次又一次地出现同样的错误。
for r in words:
if not r in stop_words:
processed_txt+=str(str(ps.stem(r) + " "))
tokenizer = RegexpTokenizer(r'\w+')
tokens = tokenizer.tokenize(processed_txt)
#print(tokens)
dictionary = corpora.Dictionary(tokens)
#corpus = [dictionary.doc2bow(text) for text in tokens]
print(dictionary)
所以现在它给出了以下错误。
raise TypeError("doc2bow expects an array of unicode tokens on input, not a
single string")
TypeError: doc2bow expects an array of unicode tokens on input, not a single
string
和"令牌下的输出"变量如下所示。
['becom', 'effect', 'willingli', 'without', 'need', 'obtain', 'knowledg', 'other', 'obtain', 'acquir', 'must', 'testamentari','claim', 'ownership', 'task', 'establish', 'endow', 'recept', 'willing', 'willsend', 'anoth', 'given', 'efficaci', 'presuppos']
请帮忙。