我有包含电影评论的数据框。如何从中删除停用词。
这是我的数据框。有两列评论(评论电影)和标签(pos或neg)。
Reviwes label
Bromwell High is a cartoon comedy. It ran at t... pos
Homelessness (or Houselessness as George Carli... pos
Brilliant over-acting by Lesley Ann Warren. Be... pos
This is easily the most underrated film inn th... pos
This is not the typical Mel Brooks film. It wa... pos
This isn't the comedic Robin Williams, nor is ... pos
Yes its an art... to successfully make a slow ... pos
In this "critically acclaimed psychological th... pos
THE NIGHT LISTENER (2006) **1/2 Robin Williams... pos
You know, Robin Williams, God bless him, is co... pos
When I first read Armistead Maupins story I wa... pos
答案 0 :(得分:0)
您需要对您的评论进行标记(sent_tokeinize进行多行评论,然后对这些句子进行word_tokenize。)检查“if not in stop_words”会排除stop_words。
编辑:谢谢亚历克西斯。将stopwords.words('English')
更改为set(stopwords.words('English'))
from nltk.tokenize import word_tokenize,sent_tokenize
from nltk.corpus import stopwords
stop_words=set(stopwords.words('English')) #set of English stop words
remove_stop_words = lambda r:[[word for word in word_tokenize(sente) if word not in stop_words] for sente in sent_tokenize(r)]
df.Reviews = df.Reviews.str.lower().apply(remove_stop_words)
示例:
>>> r
'When I first read Armistead Maupins story I wa... pos Hello. hi this i skeer.'
>>> [sente for sente in sent_tokenize(r)]
['When I first read Armistead Maupins story I wa... pos Hello.', 'hi this i skeer.']
>>> [[word for word in word_tokenize(sente)] for sente in sent_tokenize(r)]
[['When', 'I', 'first', 'read', 'Armistead', 'Maupins', 'story', 'I', 'wa', '...', 'pos', 'Hello', '.'], ['hi', 'this', 'i', 'skeer', '.']]
>>> [[word for word in word_tokenize(sente) if word not in stop_words] for sente in sent_tokenize(r)]
[['When', 'I', 'first', 'read', 'Armistead', 'Maupins', 'story', 'I', 'wa', '...', 'pos', 'Hello', '.'], ['hi', 'skeer', '.']]