如何为新数据集训练tfidfvectorizer

时间:2016-11-07 09:37:06

标签: text-classification

我正在使用tfidfvectorizer和LinearSVC进行文档分类。随着新数据集的出现,我需要一次又一次地训练tfidfvectorizer。有没有办法存储当前的tfidfvectorizer并在新数据集到来时混合新功能。

代码:

if os.path.exists("trans.pkl"):
    with open("trans.pkl", "rb") as fid:  
        transformer = cPickle.load(fid)
else:
    transformer = TfidfVectorizer(sublinear_tf=True, max_df=0.5,stop_words = 'english')
    with open("trans.pkl", "wb") as fid:
       cPickle.dump(transformer, fid)
X_train = transformer.fit_transform(train_data)
X_test = transformer.transform(test_data)
print X_train.shape[1]

if os.path.exists("store_model.pkl"):
    print "model exists"
    with open("store_model.pkl","rb") as fid:
        classifier = cPickle.load(fid)
    print classifier
else:
    print "model created"
    classifier = LinearSVC().fit(X_train, train_target)
    with open("store_model.pkl","wb") as fid:
        cPickle.dump(classifier,fid)
predictions = classifier.predict(X_test)

我有2个差异训练文件和1个测试文件。我执行了第一列火车文件的代码,然后运行良好。但是当我尝试第二列火车文件时,没有任何功能与第一列不同,因此它会出错。如果我有多个这样的数据集文件,我该如何训练我的模型。

0 个答案:

没有答案