无法从数据框中删除英文停用词

时间:2017-06-26 00:58:44

标签: python pandas nltk sentiment-analysis stop-words

我一直试图对电影评论数据集进行情绪分析,而我却陷入了无法从数据中删除英语停用词的地步。我做错了什么?

from nltk.corpus import stopwords
stop = stopwords.words("English")
list_ = []
for file_ in dataset:
    dataset['Content'] = dataset['Content'].apply(lambda x: [item for item in x.split(',') if item not in stop])
    list_.append(dataset)
dataset = pd.concat(list_, ignore_index=True)

4 个答案:

答案 0 :(得分:0)

通过您的评论我认为您不需要循环dataset。 (也许dataset只包含名为Content

的单个列

您可以这样做:

 dataset["Content"] = dataset["Content"].str.split(",").apply(lambda x: [item for item in x if item not in stop])

答案 1 :(得分:0)

您正在循环遍历数据集,但每次都附加整个帧而不是使用file_尝试:

from nltk.corpus import stopwords
stop = stopwords.words("English")
dataset['Cleaned'] = dataset['Content'].apply(lambda x: ','.join([item for item in x.split(',') if item not in stop]))

如果要将其展平为单个列表,则返回包含单词列表的系列:

flat_list = [item for sublist in list(dataset['Cleaned'].values) for item in sublist]

带帽尖到Making a flat list out of list of lists in Python

答案 2 :(得分:0)

尝试earthy

>>> from earthy.wordlist import punctuations, stopwords
>>> from earthy.preprocessing import remove_stopwords
>>> result = dataset['Content'].apply(remove_stopwords)

请参阅https://github.com/alvations/earthy/blob/master/FAQ.md#what-else-can-earthy-do

答案 3 :(得分:0)

我认为代码到目前为止应该使用信息。我所做的假设是数据具有额外的空间,同时用逗号分隔。以下是测试运行:(希望它有所帮助!

import pandas as pd
from nltk.corpus import stopwords
import nltk

stop = nltk.corpus.stopwords.words('english')

dataset = pd.DataFrame([{'Content':'i, am, the, computer, machine'}])
dataset = dataset.append({'Content':'i, play, game'}, ignore_index=True)
print(dataset)
list_ = []
for file_ in dataset:
    dataset['Content'] = dataset['Content'].apply(lambda x: [item.strip() for item in x.split(',') if item.strip() not in stop])
    list_.append(dataset)
dataset = pd.concat(list_, ignore_index=True)

print(dataset)

使用停用词输入:

                          Content
0   i, am, the, computer, machine
1                   i, play, game

输出:

                Content
 0  [computer, machine]
 1         [play, game]