我正在尝试使用训练有素的BoW,tfidf和SVM模型进行预测:
def bagOfWords(files_data):
count_vector = sklearn.feature_extraction.text.CountVectorizer()
return count_vector.fit_transform(files_data)
files = sklearn.datasets.load_files(dir_path)
word_counts = util.bagOfWords(files.data)
tf_transformer = sklearn.feature_extraction.text.TfidfTransformer(use_idf=True).fit(word_counts)
X = tf_transformer.transform(word_counts)
clf = sklearn.svm.LinearSVC()
X_train, X_test, y_train, y_test = sklearn.cross_validation.train_test_split(X, y, test_size=test_size)
我可以运行以下内容:
clf.fit(X_train, y_train)
y_predicted = clf.predict(X_test)
但是下面会收到错误:
clf.fit(X_train, y_train)
new_word_counts = util.bagOfWords(["a place to listen to music it s making its way to the us"])
ready_to_be_predicted = tf_transformer.transform(new_word_counts)
predicted = clf.predict(ready_to_be_predicted)
我认为我已经在使用以前的tf_transform了,不知道为什么还会出错。非常感谢任何帮助!
答案 0 :(得分:2)
您没有保留最初适合数据的CountVectorizer。
此bagOfWords调用在其自己的范围内安装了单独的CountVectorizer。
new_word_counts = util.bagOfWords(["a place to listen to music it s making its way to the us"])
您想要使用适合您训练集的那个。
你也在用整个X训练你的变形金刚,包括X_test。您希望将测试测试从任何培训中排除,包括转换。
尝试这样的事情。
files = sklearn.datasets.load_files(dir_path)
# Split in train/test
X_train, X_test, y_train, y_test = sklearn.cross_validation.train_test_split(files.data, file.target)
# Fit and tranform with X_train
count_vector = sklearn.feature_extraction.text.CountVectorizer()
word_counts = count_vector.fit_transform(X_train)
tf_transformer = sklearn.feature_extraction.text.TfidfTransformer(use_idf=True)
X_train = tf_transformer.fit_transform(word_counts)
clf = sklearn.svm.LinearSVC()
clf.fit(X_train, y_train)
# Transform X_test
test_word_counts = count_vector.transform(X_test)
ready_to_be_predicted = tf_transformer.transform(test_word_counts)
X_test = clf.predict(ready_to_be_predicted)
# Test example
new_word_counts = count_vector.transform["a place to listen to music it smaking its way to the us"])
ready_to_be_predicted = tf_transformer.transform(new_word_counts)
predicted = clf.predict(ready_to_be_predicted)
当然,将这些变压器组合成管道要简单得多 http://scikit-learn.org/stable/modules/generated/sklearn.pipeline.Pipeline.html