我正在尝试使用sklearn创建文本分类模型。我是python和sklearn的新手。我已经用一些训练数据制作了模型并保存了模型。但是当我尝试在另一个python程序/文件中重用该模型时,会出现错误。
我已经在stackoverflow上查看过一些类似的问题,但是找不到适合我的解决方案。
我发表了一些评论,以便您可以更轻松地阅读代码。
$ pip2 --version
bash: pip2: command not found
由于我正在用不同的方法进行训练以评估哪种方法更好,所以我制定了train_model方法。
...
# load the dataset
data = codecs.open('C:/Users/baran/PycharmProjects/test/resource/CorpusMitLabelsPlusSonstige.txt', encoding='utf8',
errors='ignore').read ()
# seperate lables from text
labels, texts = [], []
for i, line in enumerate(data.split("\n")):
content = line.split()
labels.append(content[0])
texts.append(" ".join(content[1:]))
# create a dataframe using texts and lables
trainDF = pandas.DataFrame()
trainDF['text'] = texts
trainDF['label'] = labels
# split the dataset into training and validation datasets
train_x, valid_x, train_y, valid_y = model_selection.train_test_split(trainDF['text'], trainDF['label'])
# label encode the target variable
encoder = preprocessing.LabelEncoder()
train_y = encoder.fit_transform(train_y)
valid_y = encoder.fit_transform(valid_y)
# create a count vectorizer object
count_vect = CountVectorizer(analyzer='word', token_pattern=r'\w{1,}')
count_vect.fit(trainDF['text'])
# transform the training and validation data using count vectorizer object
xtrain_count = count_vect.transform(train_x)
xvalid_count = count_vect.transform(valid_x)
...
这是“ correct_model”:
...
def train_model(classifier, feature_vector_train, label, feature_vector_valid, is_neural_net=False, is_not_tfid=False,
correct_model=False):
# fit the training dataset on the classifier
...
elif correct_model:
classifier.fit(feature_vector_train, label)
pkl_filename = "C:/Users/baran/PycharmProjects/test/resources/pickle_model.pkl"
with open(pkl_filename, 'wb') as file:
pickle.dump(classifier, file)
# with open(pkl_filename, 'rb') as file:
# pickle_model = pickle.load(file)
# joblib.dump(classifier, "C:/Users/baran/PycharmProjects/test/resources/model.pkl")
# loaded_model = joblib.load("C:/Users/baran/PycharmProjects/test/resources/model.pkl")
# result = loaded_model.score(feat)
# print(pickle_model.predict(feature_vector_valid))
...
# predict the labels on validation dataset
predictions = classifier.predict(feature_vector_valid)
...
return metrics.accuracy_score(valid_y, predictions)
...
此模型为我提供了约80%的验证数据准确性。
因此,如果可以加载和重用模型,这是我要测试的测试文件:
...
# Linear Classifier on Count Vectors
accuracy = train_model(linear_model.LogisticRegression(), xtrain_count, train_y, xvalid_count, correct_model=True)
print("LR, Count Vectors: ", accuracy)
...
然后我得到这个错误:
...
texts = []
texts.append("Der Bus hat nicht an der Haltestelle gehalten")
# create a dataframe using texts and lables
trainDF = pandas.DataFrame()
trainDF['text'] = texts
# create a count vectorizer object
count_vect = CountVectorizer(analyzer='word', token_pattern=r'\w{1,}')
count_vect.fit(trainDF['text'])
# transform the training and validation data using count vectorizer object
test_data = count_vect.transform(trainDF['text'])
# load the model
pkl_filename = "C:/Users/baran/PycharmProjects/test/resources/pickle_model.pkl"
with open(pkl_filename, 'rb') as file:
pickle_model = pickle.load(file)
#reuse the model
test_load = joblib.load("C:/Users/baran/PycharmProjects/test/model.pkl")
print(test_load.predict(test_data))
...
我期望的是,结果将给我“ 3”,这是特定标签的编码。这些预测在我训练模型的同一文件中也有效,但是由于某种原因我无法使用新的验证数据。
我认为在拟合和/或转换数据时犯了一些错误。