Question

我正在关注如何保存分类器的stackoverflow here中的帖子。当我尝试按照第二篇文章中提到的方式。但我一直在

ValueError：词汇表未安装或为空！

我的培训代码如下：

train = load_files(learning_data_train)
count_vect = CountVectorizer(tokenizer=tokenize,stop_words='english')
X_train_counts = count_vect.fit_transform(train.data)
clf = SGDClassifier(loss='hinge', penalty='l1',alpha=1e-3, n_iter=5).fit(X_train_counts, train.target)
filename = "SGD.pk1"
joblib.dump(clf, filename)

我的测试代码如下：

count_vect = CountVectorizer(tokenizer=tokenize,stop_words='english')
filename = "SGD.pk1"
clf = joblib.load(filename)
print clf 
file= "testfolder/"
docs_new = []
for i in os.listdir(file):
    docs_new.append(open(file+i,"r").read())
X_new_counts = count_vect.transform(docs_new)
predicted = clf.predict(X_new_counts)
for doc, category in zip(docs_new, predicted):
    print(' => %s' % ( train.target_names[category]))

执行

时抛出错误

X_new_counts = count_vect.transform(docs_new)

我在这里做错了吗？

Answer 1

您使用过CountVectorizer，尝试使用fit_transform

X_new_counts = count_vect.fit_transform(docs_new)

检查：

http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html#sklearn.feature_extraction.text.CountVectorizer.fit_transform

python scikit - ValueError

1 个答案: