与vectorizer的烂醉如泥的模型

时间:2018-04-15 05:44:06

标签: python machine-learning scikit-learn

我正在腌制一个模型供以后使用。然后加载模型并在其上运行predict_proba。我得到ValueError: X has 1 features per sample; expecting 319。不确定我是否正确转换它

import csv, pickle
from sklearn import svm

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.calibration import CalibratedClassifierCV
import numpy as np
import operator

train_data = []
train_labels = []
test_lables = []
test_lables.append("nah")

with open('training_file', 'r') as f:
    reader = csv.reader(f, dialect='excel', delimiter='\t')
    for row in reader:
        train_data.append(row[0])
        train_labels.append(row[1])

lables = []

for item in train_labels:
    if item in lables:
        continue
    else:
        lables.append(item)


def linear_svc(train_data, train_labels):

    vectorizer = TfidfVectorizer()
    train_vectors = vectorizer.fit_transform(train_data)
    classifier_linear = svm.LinearSVC()
    clf = CalibratedClassifierCV(classifier_linear)   
    clf.fit(train_vectors, train_labels)

    with open('test', 'wb') as fi:
        pickle.dump(clf, fi)


def run_classifier():    
    vectorizer = TfidfVectorizer()
    test_vectors = vectorizer.fit_transform(test_lables)
    with open('test', 'rb') as fi:
        clf = pickle.load(fi)
    prediction_linear = clf.predict_proba(test_vectors) 
    return prediction_linear


#linear_svc(train_data, train_labels)
sorted_intent_probability = run_classifier()
print(sorted_intent_probability)

我首先调用linear_svc()方法。模型被腌制。然后我打电话给run_classifier()。我在这做错了什么?此外,当我结合这两种方法时,它工作正常:

def linear_svc(train_data, train_labels, test_lables):

    vectorizer = TfidfVectorizer()
    train_vectors = vectorizer.fit_transform(train_data)
    test_vectors = vectorizer.transform(test_lables)
    classifier_linear = svm.LinearSVC()
    clf = CalibratedClassifierCV(classifier_linear) 

    clf.fit(train_vectors, train_labels)
    prediction_linear = clf.predict_proba(test_vectors)
    return prediction_linear

我是否需要腌制矢量化器并在以后重复使用?

1 个答案:

答案 0 :(得分:1)

我遇到了问题。当我创建TfidfVectorizer()的新实例时,我没有使用与培训相同的功能。我做了以下更改

linear_svc_model = clf.fit(train_vectors, train_labels)
model_object = []
model_object.append(linear_svc_model)
model_object.append(vectorizer)

并腌制这个model_object。然后使用unpickled分类器和矢量化器,并在训练字符串上使用相同的。有效。