Can you add to a CountVectorizer in scikit-learn?

时间:2016-02-12 20:53:10

标签: python nlp scikit-learn

I'd like to create a CountVectorizer in scikit-learn based on a corpus of text and then add more text to the CountVectorizer later (adding to the original dictionary).

If I use None, it does maintain the original vocabulary, but adds no new words. If I use transform(), it just regenerates the vocabulary from scratch. See below:

fit_transform()

I'd like the equivalent of an In [2]: count_vect = CountVectorizer() In [3]: count_vect.fit_transform(["This is a test"]) Out[3]: <1x3 sparse matrix of type '<type 'numpy.int64'>' with 3 stored elements in Compressed Sparse Row format> In [4]: count_vect.vocabulary_ Out[4]: {u'is': 0, u'test': 1, u'this': 2} In [5]: count_vect.transform(["This not is a test"]) Out[5]: <1x3 sparse matrix of type '<type 'numpy.int64'>' with 3 stored elements in Compressed Sparse Row format> In [6]: count_vect.vocabulary_ Out[6]: {u'is': 0, u'test': 1, u'this': 2} In [7]: count_vect.fit_transform(["This not is a test"]) Out[7]: <1x4 sparse matrix of type '<type 'numpy.int64'>' with 4 stored elements in Compressed Sparse Row format> In [8]: count_vect.vocabulary_ Out[8]: {u'is': 0, u'not': 1, u'test': 2, u'this': 3} function. I'd like it to work something like this:

update()

Is there a way to do this?

1 个答案:

答案 0 :(得分:4)

scikit-learn中实现的算法被设计为一次适合所有数据,这对于大多数ML算法来说是必要的(尽管有趣的不是您描述的应用程序),因此没有{{1}功能。

有一种方法可以通过略微不同的方式来达到您想要的效果,请参阅以下代码

update

哪个输出

from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
count_vect = CountVectorizer()
count_vect.fit_transform(["This is a test"])
print count_vect.vocabulary_
count_vect.fit_transform(["This is a test", "This is not a test"])
print count_vect.vocabulary_