如何使用NLTK软件包进行Lemmatization

时间:2020-07-11 10:44:49

标签: python nltk wordnet lemmatization

我有python脚本,必须对给定的输入(数据帧)执行定理。

问题是,当我尝试在数据框中打印大多数现有单词的结果时,它将显示以下结果。

Most common words:
[('**word**', 141), ('twitter', 47), ('pic', 46), ('**Portugal**', 37), ('**words**', 28), ('protest', 19), ('**Portuguese**', 19), ('country', 18), ('spread', 17), ('people', 15)]

结果必须为:

Most common words:
[('**word**', 141), ('twitter', 47), ('pic', 46), ('**Portugal**', 37), ('protest', 19), ('country', 18), ('spread', 17), ('people', 15)]

代码:

from nltk.corpus import stopwords,wordnet
from nltk.stem.wordnet import WordNetLemmatizer


def get_wordnet_pos(word):
    """Map POS tag to first character lemmatize() accepts"""
    tag = nltk.pos_tag([word])[0][1][0].upper()
    tag_dict = {"J": wordnet.ADJ,
                "N": wordnet.NOUN,
                "V": wordnet.VERB,
                "R": wordnet.ADV}

    return tag_dict.get(tag, wordnet.NOUN)


ppt = '''...!@#$%^&*()....{}’‘ “”  “[]|._-`/?:;"'\,~12345678876543'''
w_tokenizer = nltk.tokenize.WhitespaceTokenizer()
lemmatizer = nltk.stem.WordNetLemmatizer()
def lemmatize_text(text):
    return (" ".join([lemmatizer.lemmatize(w, get_wordnet_pos(w)) for w in w_tokenizer.tokenize(text) if w not in ppt]))
    

0 个答案:

没有答案