我有从pdf来源中提取的令牌列表。我能够预处理文本并将其标记化,但是我想遍历标记并将列表中的每个标记转换为wordnet语料库中的引理。因此,我的令牌列表如下所示:
['0000', 'Everyone', 'age', 'remembers', 'Þ', 'rst', 'heard', 'contest', 'I', 'sitting', 'hideout', 'watching', ...]
没有“ Everyone”,“ 0000”,“Þ”等词的词缀,还有许多我需要消除的词缀。但是对于“年龄”,“记住”,“听说”等词,令牌列表应该看起来像:
['age', 'remember', 'hear', ...]
我正在通过以下代码检查同义词:
syns = wn.synsets("heard")
print(syns[0].lemmas()[0].name())
至此,我已经在python中创建了函数clean_text()进行预处理。看起来像:
def clean_text(text):
# Eliminating punctuations
text = "".join([word for word in text if word not in string.punctuation])
# tokenizing
tokens = re.split("\W+", text)
# lemmatizing and removing stopwords
text = [wn.lemmatize(word) for word in tokens if word not in stopwords]
# converting token list into synset
syns = [text.lemmas()[0].name() for text in wn.synsets(text)]
return text
我收到错误消息:
syns = [text.lemmas()[0].name() for text in wn.synsets(text)]
AttributeError: 'list' object has no attribute 'lower'
如何获取每个引理的令牌列表?
完整代码:
import string
import re
from wordcloud import WordCloud
import nltk
from nltk.tokenize.treebank import TreebankWordDetokenizer
from nltk.corpus import wordnet
import PyPDF4
import matplotlib
import numpy as np
from PIL import Image
stopwords = nltk.corpus.stopwords.words('english')
moreStopwords = ['clin97803078874365pallr1indd'] # additional stopwords to be removed manually.
wn = nltk.WordNetLemmatizer()
data = PyPDF4.PdfFileReader(open('ReadyPlayerOne.pdf', 'rb'))
pageData = ''
for page in data.pages:
pageData += page.extractText()
# print(pageData)
def clean_text(text):
text = "".join([word for word in text if word not in string.punctuation])
tokens = re.split("\W+", text)
text = [wn.lemmatize(word) for word in tokens if word not in stopwords]
syns = [text.lemmas()[0].name() for text in wordnet.synsets(text)]
return syns
print(clean_text(pageData))
答案 0 :(得分:0)
您正在用单词列表来调用wordnet.synsets(text)
(请检查此时的text
),并且您应该使用word
来调用它。
wordnet.synsets
的预处理尝试将.lower()
应用于其参数,因此会导致错误(AttributeError: 'list' object has no attribute 'lower'
)。
下面是clean_text
的功能版本,已解决此问题:
import string
import re
import nltk
from nltk.corpus import wordnet
stopwords = nltk.corpus.stopwords.words('english')
wn = nltk.WordNetLemmatizer()
def clean_text(text):
text = "".join([word for word in text if word not in string.punctuation])
tokens = re.split("\W+", text)
text = [wn.lemmatize(word) for word in tokens if word not in stopwords]
lemmas = []
for token in text:
lemmas += [synset.lemmas()[0].name() for synset in wordnet.synsets(token)]
return lemmas
text = "The grass was greener."
print(clean_text(text))
返回:
['grass', 'Grass', 'supergrass', 'eatage', 'pot', 'grass', 'grass', 'grass', 'grass', 'grass', 'denounce', 'green', 'green', 'green', 'green', 'fleeceable']