摆脱文本挖掘练习的停用词

时间:2017-01-18 15:06:01

标签: python

我在下面的教程中有以下代码:http://blog.christianperone.com/2011/09/machine-learning-text-feature-extraction-tf-idf-part-i/

train_set = ("The sky is blue.", "The sun is bright.")
test_set = ("The sun in the sky is bright.",
        "We can see the shining sun, the bright sun.")

from sklearn.feature_extraction.text import CountVectorizer
vectorizer = CountVectorizer()

print(vectorizer)
#CountVectorizer(analyzer__min_n=1, analyzer__stop_words=set(['all']))
vectorizer.fit_transform(train_set)
print(vectorizer.vocabulary)

smatrix = vectorizer.transform(test_set)
print(smatrix.todense())

这给了我一个在不同句子中使用的单词矩阵。这种方法很好,我想摆脱一些停顿词。

因此我尝试:

CountVectorizer(analyzer__min_n=1, analyzer__stop_words=set(['is', 'the']))

然而,这给了我以下错误:

 Traceback (most recent call last):
 File "C:/Users/Marc/PycharmProjects/clustering/testing.py", line 16, in  <module>
CountVectorizer(analyzer__min_n=1, analyzer__stop_words=set(['is', 'the']))
TypeError: __init__() got an unexpected keyword argument 'analyzer__min_n'

任何想法出错的地方

0 个答案:

没有答案