我正在尝试执行干净的文档操作以删除停用词,pos标记和词干以下是我的代码
def cleanDoc(doc):
stopset = set(stopwords.words('english'))
stemmer = nltk.PorterStemmer()
#Remove punctuation,convert lower case and split into seperate words
tokens = re.findall(r"<a.*?/a>|<[^\>]*>|[\w'@#]+", doc.lower() ,flags = re.UNICODE | re.LOCALE)
#Remove stopwords and words < 2
clean = [token for token in tokens if token not in stopset and len(token) > 2]
#POS Tagging
pos = nltk.pos_tag(clean)
#Stemming
final = [stemmer.stem(word) for word in pos]
return final
我收到了这个错误:
Traceback (most recent call last):
File "C:\Users\USer\Desktop\tutorial\main.py", line 38, in <module>
final = cleanDoc(doc)
File "C:\Users\USer\Desktop\tutorial\main.py", line 30, in cleanDoc
final = [stemmer.stem(word) for word in pos]
File "C:\Python27\lib\site-packages\nltk\stem\porter.py", line 556, in stem
stem = self.stem_word(word.lower(), 0, len(word) - 1)
AttributeError: 'tuple' object has no attribute 'lower'
答案 0 :(得分:4)
在这一行:
pos = nltk.pos_tag(clean)
nltk.pos_tag()
返回元组(word, tag)
的列表,而不是字符串。用这个来得到这些词:
pos = nltk.pos_tag(clean)
final = [stemmer.stem(tagged_word[0]) for tagged_word in pos]
答案 1 :(得分:2)
nltk.pos_tag
返回元组列表,而不是字符串列表。也许你想要
final = [stemmer.stem(word) for word, _ in pos]