X每个样本有7个特征;期待18282

时间:2019-09-22 13:08:58

标签: python-3.x scikit-learn text-classification

我正在尝试使用sklearn创建文本分类模型。我是python和sklearn的新手。我已经用一些训练数据制作了模型并保存了模型。但是当我尝试在另一个python程序/文件中重用该模型时,会出现错误。

我已经在stackoverflow上查看过一些类似的问题,但是找不到适合我的解决方案。

我发表了一些评论,以便您可以更轻松地阅读代码。

$ pip2 --version
bash: pip2: command not found

由于我正在用不同的方法进行训练以评估哪种方法更好,所以我制定了train_model方法。

...
# load the dataset
data = codecs.open('C:/Users/baran/PycharmProjects/test/resource/CorpusMitLabelsPlusSonstige.txt', encoding='utf8',
               errors='ignore').read ()

# seperate lables from text
labels, texts = [], []
for i, line in enumerate(data.split("\n")):
    content = line.split()
    labels.append(content[0])
    texts.append(" ".join(content[1:]))

# create a dataframe using texts and lables
trainDF = pandas.DataFrame()
trainDF['text'] = texts
trainDF['label'] = labels

# split the dataset into training and validation datasets
train_x, valid_x, train_y, valid_y = model_selection.train_test_split(trainDF['text'], trainDF['label'])

# label encode the target variable
encoder = preprocessing.LabelEncoder()
train_y = encoder.fit_transform(train_y)
valid_y = encoder.fit_transform(valid_y)

# create a count vectorizer object
count_vect = CountVectorizer(analyzer='word', token_pattern=r'\w{1,}')
count_vect.fit(trainDF['text'])

# transform the training and validation data using count vectorizer object
xtrain_count = count_vect.transform(train_x)
xvalid_count = count_vect.transform(valid_x)
...

这是“ correct_model”:

...
def train_model(classifier, feature_vector_train, label, feature_vector_valid, is_neural_net=False, is_not_tfid=False,
            correct_model=False):
    # fit the training dataset on the classifier
    ...
    elif correct_model:
        classifier.fit(feature_vector_train, label)
        pkl_filename = "C:/Users/baran/PycharmProjects/test/resources/pickle_model.pkl"
        with open(pkl_filename, 'wb') as file:
            pickle.dump(classifier, file)
        # with open(pkl_filename, 'rb') as file:
        #     pickle_model = pickle.load(file)
        # joblib.dump(classifier, "C:/Users/baran/PycharmProjects/test/resources/model.pkl")
        # loaded_model = joblib.load("C:/Users/baran/PycharmProjects/test/resources/model.pkl")
        # result = loaded_model.score(feat)
        # print(pickle_model.predict(feature_vector_valid))
    ...
    # predict the labels on validation dataset
    predictions = classifier.predict(feature_vector_valid)
    ...
    return metrics.accuracy_score(valid_y, predictions)
...

此模型为我提供了约80%的验证数据准确性。

因此,如果可以加载和重用模型,这是我要测试的测试文件:

...
# Linear Classifier on Count Vectors
accuracy = train_model(linear_model.LogisticRegression(), xtrain_count, train_y, xvalid_count, correct_model=True)
print("LR, Count Vectors: ", accuracy)
...

然后我得到这个错误:

...
texts = []
texts.append("Der Bus hat nicht an der Haltestelle gehalten")

# create a dataframe using texts and lables
trainDF = pandas.DataFrame()
trainDF['text'] = texts

# create a count vectorizer object
count_vect = CountVectorizer(analyzer='word', token_pattern=r'\w{1,}')
count_vect.fit(trainDF['text'])

# transform the training and validation data using count vectorizer object
test_data = count_vect.transform(trainDF['text'])

# load the model
pkl_filename = "C:/Users/baran/PycharmProjects/test/resources/pickle_model.pkl"
with open(pkl_filename, 'rb') as file:
    pickle_model = pickle.load(file)

#reuse the model
test_load = joblib.load("C:/Users/baran/PycharmProjects/test/model.pkl")
print(test_load.predict(test_data))
...

我期望的是,结果将给我“ 3”,这是特定标签的编码。这些预测在我训练模型的同一文件中也有效,但是由于某种原因我无法使用新的验证数据。

我认为在拟合和/或转换数据时犯了一些错误。

0 个答案:

没有答案