I'd like to create a CountVectorizer in scikit-learn based on a corpus of text and then add more text to the CountVectorizer later (adding to the original dictionary).
If I use None
, it does maintain the original vocabulary, but adds no new words. If I use transform()
, it just regenerates the vocabulary from scratch. See below:
fit_transform()
I'd like the equivalent of an In [2]: count_vect = CountVectorizer()
In [3]: count_vect.fit_transform(["This is a test"])
Out[3]:
<1x3 sparse matrix of type '<type 'numpy.int64'>'
with 3 stored elements in Compressed Sparse Row format>
In [4]: count_vect.vocabulary_
Out[4]: {u'is': 0, u'test': 1, u'this': 2}
In [5]: count_vect.transform(["This not is a test"])
Out[5]:
<1x3 sparse matrix of type '<type 'numpy.int64'>'
with 3 stored elements in Compressed Sparse Row format>
In [6]: count_vect.vocabulary_
Out[6]: {u'is': 0, u'test': 1, u'this': 2}
In [7]: count_vect.fit_transform(["This not is a test"])
Out[7]:
<1x4 sparse matrix of type '<type 'numpy.int64'>'
with 4 stored elements in Compressed Sparse Row format>
In [8]: count_vect.vocabulary_
Out[8]: {u'is': 0, u'not': 1, u'test': 2, u'this': 3}
function. I'd like it to work something like this:
update()
Is there a way to do this?
答案 0 :(得分:4)
scikit-learn
中实现的算法被设计为一次适合所有数据,这对于大多数ML算法来说是必要的(尽管有趣的不是您描述的应用程序),因此没有{{1}功能。
有一种方法可以通过略微不同的方式来达到您想要的效果,请参阅以下代码
update
哪个输出
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
count_vect = CountVectorizer()
count_vect.fit_transform(["This is a test"])
print count_vect.vocabulary_
count_vect.fit_transform(["This is a test", "This is not a test"])
print count_vect.vocabulary_