删除停用词Python

时间:2018-10-23 04:14:30

标签: python python-3.x pandas nltk

所以我正在读取一个csv文件,并在其中获取单词。我正在尝试删除停用词。这是我的代码。

import pandas as pd
from nltk.corpus import stopwords as sw

def loadCsv(fileName):
    df = pd.read_csv(fileName, error_bad_lines=False)
    df.dropna(inplace = True)
    return df

def getWords(dataframe):
    words = []
    for tweet in dataframe['SentimentText'].tolist():
        for word in tweet.split():
            word = word.lower()

        words.append(word)

    return set(words) #Create a set from the words list

def removeStopWords(words):
    for word in words: # iterate over word_list
        if word in sw.words('english'): 
            words.remove(word) # remove word from filtered_word_list if it is a stopword

    return set(words)

df = loadCsv("train.csv")
words = getWords(df)
words = removeStopWords(words)

在此行上

if word in sw.words('english'):

我收到以下错误。

  

例外:没有描述

再往下,我将尝试删除标点符号,任何指向它的指针也将很棒。 任何帮助深表感谢。

编辑

def removeStopWords(words):
    filtered_word_list = words #make a copy of the words
    for word in words: # iterate over words
        if word in sw.words('english'): 
            filtered_word_list.remove(word) # remove word from filtered_word_list if it is a stopword

    return set(filtered_word_list)

3 个答案:

答案 0 :(得分:0)

将removeStopWords函数更改为以下内容:

def getFilteredStopWords(words):
    list_stopWords=list(set(sw.words('english')))
    filtered_words=[w for w in words if not w in list_stopWords# remove word from filtered_words if it is a stopword
    return filtered_words

答案 1 :(得分:0)

这是问题的简化版本,没有熊猫。我认为原始代码的问题在于在迭代过程中修改集合words。通过使用条件列表理解,我们可以测试每个单词,创建一个新列表,并最终按照原始代码将其转换为一个集合。

from nltk.corpus import stopwords as sw

def removeStopWords(words):
    return set([w for w in words if not w in sw.words('english')])

sentence = 'this is a very common english sentence with a finite set of words from my imagination'
words = set(sentence.split())
print(removeStopWords(words))

答案 2 :(得分:0)

def remmove_stopwords(sentence):
    list_stop_words = set(stopwords.words('english'))
    words = sentence.split(' ')
    filtered_words = [w for w in words if w not in list_stop_words]
    sentence_list = ' '.join(w for w in filtered_words)
    return sentence_list