想要使用nltk从数据框中删除停用词

时间:2017-10-26 17:11:02

标签: python pandas nltk

我有包含电影评论的数据框。如何从中删除停用词。

这是我的数据框。有两列评论(评论电影)和标签(pos或neg)。

                                Reviwes                 label  
Bromwell High is a cartoon comedy. It ran at t...   pos  
Homelessness (or Houselessness as George Carli...   pos  
Brilliant over-acting by Lesley Ann Warren. Be...   pos  
This is easily the most underrated film inn th...   pos  
This is not the typical Mel Brooks film. It wa...   pos  
This isn't the comedic Robin Williams, nor is ...   pos  
Yes its an art... to successfully make a slow ...   pos  
In this "critically acclaimed psychological th...   pos  
THE NIGHT LISTENER (2006) **1/2 Robin Williams...   pos  
You know, Robin Williams, God bless him, is co...   pos  
When I first read Armistead Maupins story I wa...   pos  

1 个答案:

答案 0 :(得分:0)

您需要对您的评论进行标记(sent_tokeinize进行多行评论,然后对这些句子进行word_tokenize。)检查“if not in stop_words”会排除stop_words。

编辑:谢谢亚历克西斯。将stopwords.words('English')更改为set(stopwords.words('English'))

from nltk.tokenize import word_tokenize,sent_tokenize
from nltk.corpus import stopwords
stop_words=set(stopwords.words('English')) #set of English stop words
remove_stop_words = lambda r:[[word for word in word_tokenize(sente) if word not in stop_words] for sente in sent_tokenize(r)]
df.Reviews = df.Reviews.str.lower().apply(remove_stop_words)

示例:

>>> r
'When I first read Armistead Maupins story I wa...   pos  Hello. hi this i skeer.'
>>> [sente for sente in sent_tokenize(r)]
['When I first read Armistead Maupins story I wa...   pos  Hello.', 'hi this i skeer.']
>>> [[word for word in word_tokenize(sente)] for sente in sent_tokenize(r)]
[['When', 'I', 'first', 'read', 'Armistead', 'Maupins', 'story', 'I', 'wa', '...', 'pos', 'Hello', '.'], ['hi', 'this', 'i', 'skeer', '.']]
>>> [[word for word in word_tokenize(sente) if word not in stop_words] for sente in sent_tokenize(r)]
[['When', 'I', 'first', 'read', 'Armistead', 'Maupins', 'story', 'I', 'wa', '...', 'pos', 'Hello', '.'], ['hi', 'skeer', '.']]