Question

我将在文本分析中提供部分代码以打印条款。我想知道我应该在for循环中放入什么条件来删除属于同一根的那些词。例如，expanse，expand和expansion应该全部变成expand。

Wordnet Lemmatizer实际上没有完成其工作，因此我从同一词根中收到了许多不必要的单词，这些单词不需要进行分析。

terms_list=[[tok  for tok in doc.split() if tok not in stoplist] for doc in stopped_tokens]
print(terms_list)
print(len(terms_list))

count=0
for doc in terms_list:

    for word in doc:
        print (word)

        if word == "|>" or word == "|>" or word == "_" or word == "-" or word == "#":
            terms_list[count].remove (word)
        if word == "?":
            terms_list[count].remove (word)
        if word == "...":
            terms_list[count].remove (word)    
        if word == "_/":
            terms_list[count].remove (word)  
        if word == "i" or word == "a":
            terms_list[count].remove (word)
        if word == "the" or word == "but" or word=="if" or word=="it":
            terms_list[count].remove (word)  
            count=count+1


print (terms_list)

检查单词是否来自同一词根

0 个答案: