如何使用sklearn中训练有素的NB分类器预测电子邮件的标签?

时间:2016-05-06 09:25:15

标签: python-3.x machine-learning scikit-learn naivebayes

我在电子邮件(垃圾邮件/非垃圾邮件)数据集上创建了一个高斯朴素贝叶斯分类器,并且能够成功运行它。我对数据进行了矢量化,将其划分为训练集和测试集,然后计算了sklearn-Gaussian Naive Bayes分类器中存在的所有特征的准确性。

现在我希望能够使用此分类器来预测新电子邮件的“标签” - 无论是否是垃圾邮件。 例如,说我有一封电子邮件。我想将它提供给我的分类器,并预测它是否是垃圾邮件。我怎样才能做到这一点?请帮助。

分类器文件的代码。

#!/usr/bin/python

import sys
from time import time
import logging

# Display progress logs on stdout
logging.basicConfig(level = logging.DEBUG, format = '%(asctime)s %(message)s')

sys.path.append("../DatasetProcessing/")
from vectorize_split_dataset import preprocess

### features_train and features_test are the features
for the training and testing datasets, respectively### labels_train and labels_test are the corresponding item labels
features_train, features_test, labels_train, labels_test = preprocess()

#########################################################
from sklearn.naive_bayes import GaussianNB
clf = GaussianNB()
t0 = time()
clf.fit(features_train, labels_train)
pred = clf.predict(features_test)
print("training time:", round(time() - t0, 3), "s")
print(clf.score(features_test, labels_test))

## Printing Metrics
for Training and Testing
print("No. of Testing Features:" + str(len(features_test)))
print("No. of Testing Features Label:" + str(len(labels_test)))
print("No. of Training Features:" + str(len(features_train)))
print("No. of Training Features Label:" + str(len(labels_train)))
print("No. of Predicted Features:" + str(len(pred)))

## Calculating Classifier Performance
from sklearn.metrics import classification_report
y_true = labels_test
y_pred = pred
labels = ['0', '1']
target_names = ['class 0', 'class 1']
print(classification_report(y_true, y_pred, target_names = target_names, labels = labels))

# How to predict label of a new text
new_text = "You won a lottery at UK lottery commission. Reply to claim it"

矢量化代码

#!/usr/bin/python

import os
import pickle
import numpy
numpy.random.seed(42)

path = os.path.dirname(os.path.abspath(__file__))

### The words(features) and label_data(labels), already largely processed.###These files should have been created beforehand
feature_data_file = path + "./createdDataset/dataSet.pkl"
label_data_file = path + "./createdDataset/dataLabel.pkl"

feature_data = pickle.load(open(feature_data_file, "rb"))
label_data = pickle.load(open(label_data_file, "rb"))

### test_size is the percentage of events assigned to the test set(the### remainder go into training)### feature matrices changed to dense representations
for compatibility with### classifier functions in versions 0.15.2 and earlier
from sklearn import cross_validation
features_train, features_test, labels_train, labels_test = cross_validation.train_test_split(feature_data, label_data, test_size = 0.1, random_state = 42)

from sklearn.feature_extraction.text import TfidfVectorizer
vectorizer = TfidfVectorizer(sublinear_tf = True, max_df = 0.5, stop_words = 'english')
features_train = vectorizer.fit_transform(features_train)
features_test = vectorizer.transform(features_test)#.toarray()

## feature selection to reduce dimensionality
from sklearn.feature_selection import SelectPercentile, f_classif
selector = SelectPercentile(f_classif, percentile = 5)
selector.fit(features_train, labels_train)
features_train_transformed_reduced = selector.transform(features_train).toarray()
features_test_transformed_reduced = selector.transform(features_test).toarray()

features_train = features_train_transformed_reduced
features_test = features_test_transformed_reduced

def preprocess():
  return features_train, features_test, labels_train, labels_test

数据集生成代码

#!/usr/bin/python

import os
import pickle
import re
import sys

# sys.path.append("../tools/")


""
"
    Starter code to process the texts of accuate and inaccurate category to extract
    the features and get the documents ready for classification.

    The list of all the texts from accurate category are in the accurate_files list
    likewise for texts of inaccurate category are in (inaccurate_files)

    The data is stored in lists and packed away in pickle files at the end.
"
""


accurate_files = open("./rawDatasetLocation/accurateFiles.txt", "r")
inaccurate_files = open("./rawDatasetLocation/inaccurateFiles.txt", "r")

label_data = []
feature_data = []

### temp_counter is a way to speed up the development--there are### thousands of lines of accurate and inaccurate text, so running over all of them### can take a long time### temp_counter helps you only look at the first 200 lines in the list so you### can iterate your modifications quicker
temp_counter = 0


for name, from_text in [("accurate", accurate_files), ("inaccurate", inaccurate_files)]:
  for path in from_text: ###only look at first 200 texts when developing### once everything is working, remove this line to run over full dataset
temp_counter = 1
if temp_counter < 200:
  path = os.path.join('..', path[: -1])
print(path)
text = open(path, "r")
line = text.readline()
while line: ###use a
function parseOutText to extract the text from the opened text# stem_text = parseOutText(text)
stem_text = text.readline().strip()
print(stem_text)### use str.replace() to remove any instances of the words# stem_text = stem_text.replace("germani", "")### append the text to feature_data
feature_data.append(stem_text)### append a 0 to label_data
if text is from Sara, and 1
if text is from Chris
if (name == "accurate"):
  label_data.append("0")
elif(name == "inaccurate"):
  label_data.append("1")

line = text.readline()

text.close()

print("texts processed")
accurate_files.close()
inaccurate_files.close()

pickle.dump(feature_data, open("./createdDataset/dataSet.pkl", "wb"))
pickle.dump(label_data, open("./createdDataset/dataLabel.pkl", "wb"))

此外,我想知道我是否可以逐步训练分类器含义,从而重新训练创建的模型,使用更新的数据来随时间改进模型?

如果有人可以帮我解决这个问题,我会很高兴的。我真的陷入了困境。

1 个答案:

答案 0 :(得分:1)

您已使用模型预测测试集中的电子邮件标签。这就是pred = clf.predict(features_test)的作用。如果您想查看这些标签,请执行print pred

但是,您可能知道如何预测您将来发现的电子邮件标签以及目前不在您的测试集中的电子邮件标签?如果是这样,您可以将新电子邮件视为新的测试集。与之前的测试集一样,您需要对数据运行几个关键处理步骤:

1)您需要做的第一件事就是为新的电子邮件数据生成功能。功能生成步骤不包含在上面的代码中,但需要进行。

2)您正在使用Tfidf矢量化器,它根据术语频率和反向文档频率将文档集合转换为Tfidf特征矩阵。您需要将新的电子邮件测试功能数据通过适合您的训练数据的矢量化器。

3)然后,您的新电子邮件测试功能数据需要使用适合您的训练数据的相同selector进行降维。

4)最后,运行预测新的测试数据。如果要查看新标签,请使用print pred

回答有关迭代重新训练模型的最终问题,是的,你绝对可以做到这一点。只需选择一个频率,生成一个脚本,通过输入数据扩展您的数据集,然后从那里重新运行所有步骤,从预处理到Tfidf矢量化,再到降维,到拟合,以及预测。