从停止词之外的NLTK分发中删除特定单词

时间:2015-08-05 08:14:38

标签: python list nltk

我有一个简单的句子。我想从列表中删除介词和AIT之类的单词。我查看了自然语言工具包(NLTK)文档,但我找不到任何内容。有人能告诉我怎么样?这是我的代码:

import nltk
from nltk.tokenize import RegexpTokenizer
test = "Hello, this is my sentence. It is a very basic sentence with not much information in it"
test = test.upper()
tokenizer = RegexpTokenizer(r'\w+')
tokens = tokenizer.tokenize(test)
fdist = nltk.FreqDist(tokens)
common = fdist.most_common(100)

2 个答案:

答案 0 :(得分:4)

可能stopwords是您正在寻找的解决方案吗?

您可以从标记化文本中轻松过滤它们:

from nltk.probability import FreqDist
from nltk.tokenize import RegexpTokenizer
from nltk.corpus import stopwords

en_stopws = stopwords.words('english')  # this loads the default stopwords list for English
en_stopws.append('spam')  # add any words you don't like to the list

test = "Hello, this is my sentence. It is a very basic sentence with not much information in it but a lot of spam"
test = test.lower()  # I changed it to lower(), since stopwords are all lower case
tokenizer = RegexpTokenizer(r'\w+')
tokens = tokenizer.tokenize(test)
tokens = [token for token in tokens if token not in en_stopws]  # filter stopwords
fdist = FreqDist(tokens)
common = fdist.most_common(100)

如果您发现让我知道的事情,我找不到从FreqDist删除条目的好方法。

答案 1 :(得分:2)

基本上,-Woverloaded-virtualnltk.probability.FreqDist对象(https://github.com/nltk/nltk/blob/develop/nltk/probability.py#L61)。给定一个字典对象,有几种方法可以过滤它:

<强> 1。读入FreqDist并使用lambda函数

对其进行过滤
collections.Counter

<强> 2。阅读FreqDist并使用字典理解对其进行过滤

>>> import nltk
>>> text = "Hello, this is my sentence. It is a very basic sentence with not much information in it"
>>> tokenized_text = nltk.word_tokenize(text)
>>> stopwords = nltk.corpus.stopwords.words('english')
>>> word_freq = nltk.FreqDist(tokenized_text)
>>> dict_filter = lambda word_freq, stopwords: dict( (word,word_freq[word]) for word in word_freq if word not in stopwords )
>>> filtered_word_freq = dict_filter(word_freq, stopwords)
>>> len(word_freq)
17
>>> len(filtered_word_freq)
8
>>> word_freq
FreqDist({'sentence': 2, 'is': 2, 'a': 1, 'information': 1, 'this': 1, 'with': 1, 'in': 1, ',': 1, '.': 1, 'very': 1, ...})
>>> filtered_word_freq
{'information': 1, 'sentence': 2, ',': 1, '.': 1, 'much': 1, 'basic': 1, 'It': 1, 'Hello': 1}

第3。在读入FreqDist之前过滤单词

>>> word_freq
FreqDist({'sentence': 2, 'is': 2, 'a': 1, 'information': 1, 'this': 1, 'with': 1, 'in': 1, ',': 1, '.': 1, 'very': 1, ...})
>>> filtered_word_freq = dict((word, freq) for word, freq in word_freq.items() if word not in stopwords)
>>> filtered_word_freq 
{'information': 1, 'sentence': 2, ',': 1, '.': 1, 'much': 1, 'basic': 1, 'It': 1, 'Hello': 1}