我将在文本分析中提供部分代码以打印条款。我想知道我应该在for循环中放入什么条件来删除属于同一根的那些词。例如,expanse
,expand
和expansion
应该全部变成expand
。
Wordnet Lemmatizer实际上没有完成其工作,因此我从同一词根中收到了许多不必要的单词,这些单词不需要进行分析。
terms_list=[[tok for tok in doc.split() if tok not in stoplist] for doc in stopped_tokens]
print(terms_list)
print(len(terms_list))
count=0
for doc in terms_list:
for word in doc:
print (word)
if word == "|>" or word == "|>" or word == "_" or word == "-" or word == "#":
terms_list[count].remove (word)
if word == "?":
terms_list[count].remove (word)
if word == "...":
terms_list[count].remove (word)
if word == "_/":
terms_list[count].remove (word)
if word == "i" or word == "a":
terms_list[count].remove (word)
if word == "the" or word == "but" or word=="if" or word=="it":
terms_list[count].remove (word)
count=count+1
print (terms_list)