如何使用nltk或python删除停用词

时间:2011-03-30 12:36:26

标签: python nltk stop-words

所以我有一个数据集,我想删除使用

的停用词
stopwords.words('english')

我正在努力如何在我的代码中使用它来简单地取出这些单词。我已经有了这个数据集中的单词列表,我正在努力的部分是与此列表进行比较并删除停用词。 任何帮助表示赞赏。

14 个答案:

答案 0 :(得分:178)

from nltk.corpus import stopwords
# ...
filtered_words = [word for word in word_list if word not in stopwords.words('english')]

答案 1 :(得分:19)

您也可以设置差异,例如:

list(set(nltk.regexp_tokenize(sentence, pattern, gaps=True)) - set(nltk.corpus.stopwords.words('english')))

答案 2 :(得分:14)

我想你有一个单词列表(word_list),你要从中删除停用词。你可以这样做:

filtered_word_list = word_list[:] #make a copy of the word_list
for word in word_list: # iterate over word_list
  if word in stopwords.words('english'): 
    filtered_word_list.remove(word) # remove word from filtered_word_list if it is a stopword

答案 3 :(得分:9)

要排除所有类型的停用词,包括nltk停用词,您可以执行以下操作:

from stop_words import get_stop_words
from nltk.corpus import stopwords

stop_words = list(get_stop_words('en'))         #About 900 stopwords
nltk_words = list(stopwords.words('english')) #About 150 stopwords
stop_words.extend(nltk_words)

output = [w for w in word_list if not w in stop_words]

答案 4 :(得分:3)

使用 textcleaner 库从数据中删除停用词。

关注此链接:https://yugantm.github.io/textcleaner/documentation.html#remove_stpwrds

请按照以下步骤操作,以使用此库。

pip install textcleaner

安装后:

import textcleaner as tc
data = tc.document(<file_name>) 
#you can also pass list of sentences to the document class constructor.
data.remove_stpwrds() #inplace is set to False by default

使用上面的代码删除停用词。

答案 5 :(得分:3)

为此,有一个非常简单的轻量级python软件包stop-words

首先使用以下方法安装软件包: pip install stop-words

然后,您可以使用列表理解功能将一行中的单词删除:

from stop_words import get_stop_words

filtered_words = [word for word in dataset if word not in get_stop_words('english')]

此软件包的下载量非常轻(不同于nltk),适用于Python 2Python 3,并且具有许多其他语言的停用词,例如:

    Arabic
    Bulgarian
    Catalan
    Czech
    Danish
    Dutch
    English
    Finnish
    French
    German
    Hungarian
    Indonesian
    Italian
    Norwegian
    Polish
    Portuguese
    Romanian
    Russian
    Spanish
    Swedish
    Turkish
    Ukrainian

答案 6 :(得分:1)

你可以使用这个功能,你应该注意到你需要降低所有单词

Textbox1.Tag = Nothing

答案 7 :(得分:1)

使用filter

from nltk.corpus import stopwords
# ...  
filtered_words = list(filter(lambda word: word not in stopwords.words('english'), word_list))

答案 8 :(得分:1)

如果要立即将答案输入字符串(而不是过滤单词的列表),这是我的看法:

STOPWORDS = set(stopwords.words('english'))
text =  ' '.join([word for word in text.split() if word not in STOPWORDS]) # delete stopwords from text

答案 9 :(得分:1)

虽然这个问题有点老了,但这里有一个新的库,值得一提,可以做额外的任务。

在某些情况下,您不想只删除停用词。相反,您可能希望在文本数据中找到停用词并将其存储在列表中,以便您可以找到数据中的噪音并使其更具交互性。

该库名为 'textfeatures'。您可以按如下方式使用它:

! pip install textfeatures
import textfeatures as tf
import pandas as pd

例如,假设您有以下一组字符串:

texts = [
    "blue car and blue window",
    "black crow in the window",
    "i see my reflection in the window"]

df = pd.DataFrame(texts) # Convert to a dataframe
df.columns = ['text'] # give a name to the column
df

现在,调用 stopwords() 函数并传递您想要的参数:

tf.stopwords(df,"text","stopwords") # extract stop words
df[["text","stopwords"]].head() # give names to columns

结果是:

    text                                 stopwords
0   blue car and blue window             [and]
1   black crow in the window             [in, the]
2   i see my reflection in the window    [i, my, in, the]

如您所见,最后一列包含该文档(记录)中包含的停用词。

答案 10 :(得分:0)

   import sys
print ("enter the string from which you want to remove list of stop words")
userstring = input().split(" ")
list =["a","an","the","in"]
another_list = []
for x in userstring:
    if x not in list:           # comparing from the list and removing it
        another_list.append(x)  # it is also possible to use .remove
for x in another_list:
     print(x,end=' ')

   # 2) if you want to use .remove more preferred code
    import sys
    print ("enter the string from which you want to remove list of stop words")
    userstring = input().split(" ")
    list =["a","an","the","in"]
    another_list = []
    for x in userstring:
        if x in list:           
            userstring.remove(x)  
    for x in userstring:           
        print(x,end = ' ') 
    #the code will be like this

答案 11 :(得分:0)

如果您的数据存储为Pandas DataFrame,则可以从textero中使用default使用NLTK停用词列表的remove_stopwords

import pandas as pd
import texthero as hero
df['text_without_stopwords'] = hero.remove_stopwords(df['text'])

答案 12 :(得分:0)

from nltk.corpus import stopwords 

from nltk.tokenize import word_tokenize 

example_sent = "This is a sample sentence, showing off the stop words filtration."

  
stop_words = set(stopwords.words('english')) 
  
word_tokens = word_tokenize(example_sent) 
  
filtered_sentence = [w for w in word_tokens if not w in stop_words] 
  
filtered_sentence = [] 
  
for w in word_tokens: 
    if w not in stop_words: 
        filtered_sentence.append(w) 
  
print(word_tokens) 
print(filtered_sentence) 

答案 13 :(得分:0)

我将向您展示一些示例 首先,我从数据帧(twitter_df)中提取文本数据,以进行如下进一步处理

     from nltk.tokenize import word_tokenize
     tweetText = twitter_df['text']

然后标记化我使用以下方法

     from nltk.tokenize import word_tokenize
     tweetText = tweetText.apply(word_tokenize)

然后,删除停用词,

     from nltk.corpus import stopwords
     nltk.download('stopwords')

     stop_words = set(stopwords.words('english'))
     tweetText = tweetText.apply(lambda x:[word for word in x if word not in stop_words])
     tweetText.head()

我认为这会对您有所帮助