如何从数据帧列传递停用词

时间:2017-12-12 14:06:34

标签: pandas scikit-learn nlp

我直接将dataframe列传递给stopword时出错。 我该如何解决这个问题

    stop_words_corpus=pd.DataFrame(word_dictionary_corpus.Word.unique(),columns=feature_names)

cv = CountVectorizer( max_features = 200,analyzer='word',stop_words= stop_words_corpus) 
cv_txt = cv.fit_transform(data.pop('Clean_addr'))

****Updated Error***
fit_transform中的

〜\ AppData \ Local \ Continuum \ anaconda3 \ lib \ site-packages \ sklearn \ feature_extraction \ text.py(self,raw_documents,y)         867         868词汇,X = self._count_vocab(raw_documents,      - > 869 self.fixed_vocabulary_)         870         871如果是self.binary:

~\AppData\Local\Continuum\anaconda3\lib\site-packages\sklearn\feature_extraction\text.py in _count_vocab(self, raw_documents, fixed_vocab)
    783             vocabulary.default_factory = vocabulary.__len__
    784 
--> 785         analyze = self.build_analyzer()
    786         j_indices = []
    787         indptr = _make_int_array()

~\AppData\Local\Continuum\anaconda3\lib\site-packages\sklearn\feature_extraction\text.py in build_analyzer(self)
    260 
    261         elif self.analyzer == 'word':
--> 262             stop_words = self.get_stop_words()
    263             tokenize = self.build_tokenizer()
    264 


I fixed the error taht error still having the issue

2 个答案:

答案 0 :(得分:1)

试试这个:

cv = CountVectorizer(max_features = 200,
                     analyzer='word',
                     stop_words=stop_words_corpus.stack().unique())

答案 1 :(得分:0)

我们需要将数据框设置为NpArray以将停用词传递到计数器

stop_word =stop_words_corpus['Word'].values

cv = CountVectorizer(max_features = 200,
                     analyzer='word',
                     stop_words=stop_word)