我有python脚本,必须对给定的输入(数据帧)执行定理。
问题是,当我尝试在数据框中打印大多数现有单词的结果时,它将显示以下结果。
Most common words:
[('**word**', 141), ('twitter', 47), ('pic', 46), ('**Portugal**', 37), ('**words**', 28), ('protest', 19), ('**Portuguese**', 19), ('country', 18), ('spread', 17), ('people', 15)]
结果必须为:
Most common words:
[('**word**', 141), ('twitter', 47), ('pic', 46), ('**Portugal**', 37), ('protest', 19), ('country', 18), ('spread', 17), ('people', 15)]
代码:
from nltk.corpus import stopwords,wordnet
from nltk.stem.wordnet import WordNetLemmatizer
def get_wordnet_pos(word):
"""Map POS tag to first character lemmatize() accepts"""
tag = nltk.pos_tag([word])[0][1][0].upper()
tag_dict = {"J": wordnet.ADJ,
"N": wordnet.NOUN,
"V": wordnet.VERB,
"R": wordnet.ADV}
return tag_dict.get(tag, wordnet.NOUN)
ppt = '''...!@#$%^&*()....{}’‘ “” “[]|._-`/?:;"'\,~12345678876543'''
w_tokenizer = nltk.tokenize.WhitespaceTokenizer()
lemmatizer = nltk.stem.WordNetLemmatizer()
def lemmatize_text(text):
return (" ".join([lemmatizer.lemmatize(w, get_wordnet_pos(w)) for w in w_tokenizer.tokenize(text) if w not in ppt]))