我正在使用tfidfvectorizer和LinearSVC进行文档分类。随着新数据集的出现,我需要一次又一次地训练tfidfvectorizer。有没有办法存储当前的tfidfvectorizer并在新数据集到来时混合新功能。
代码:
if os.path.exists("trans.pkl"):
with open("trans.pkl", "rb") as fid:
transformer = cPickle.load(fid)
else:
transformer = TfidfVectorizer(sublinear_tf=True, max_df=0.5,stop_words = 'english')
with open("trans.pkl", "wb") as fid:
cPickle.dump(transformer, fid)
X_train = transformer.fit_transform(train_data)
X_test = transformer.transform(test_data)
print X_train.shape[1]
if os.path.exists("store_model.pkl"):
print "model exists"
with open("store_model.pkl","rb") as fid:
classifier = cPickle.load(fid)
print classifier
else:
print "model created"
classifier = LinearSVC().fit(X_train, train_target)
with open("store_model.pkl","wb") as fid:
cPickle.dump(classifier,fid)
predictions = classifier.predict(X_test)
我有2个差异训练文件和1个测试文件。我执行了第一列火车文件的代码,然后运行良好。但是当我尝试第二列火车文件时,没有任何功能与第一列不同,因此它会出错。如果我有多个这样的数据集文件,我该如何训练我的模型。