在Scikit-Learn中使用新数据重新训练持久SVM模型(Python 3)

时间:2017-10-03 21:12:07

标签: python machine-learning scikit-learn svm

我正在使用Scikit-Learn从事Python机器学习计划,该计划将根据内容将电子邮件分类为问题类型。例如:有人给我发电子邮件说"这个程序没有启动",机器将其分类为" Crash Issue"。

我使用SVM算法从2个CSV文件中读取电子邮件内容及其各自的类别标签。我写了两个程序:

  1. 第一个程序训练机器并使用joblib.dump()导出训练的模型,以便第二个程序可以使用训练的模型
  2. 第二个程序通过导入训练模型进行预测。我希望第二个程序能够通过重新拟合分类器和新数据来更新训练模型。但我不知道如何实现这一点。预测程序要求用户在其上键入电子邮件,然后它将进行预测。然后它将询问用户其预测是否正确。在这两种情况下,我都希望机器能够从结果中学习。
  3. 培训计划:

    import numpy as np
    import pandas as pd
    from pandas import DataFrame
    import os
    from sklearn import svm
    from sklearn import preprocessing
    from sklearn.feature_extraction.text import TfidfVectorizer
    from sklearn.externals import joblib
    
    
    ###### Extract and Vectorize the features from each email in the Training Data ######
    features_file = "features.csv" #The CSV file that contains the descriptions of each email. Features will be extracted from this text data
    features_df = pd.read_csv(features_file, encoding='ISO-8859-1') 
    vectorizer = TfidfVectorizer()
    features = vectorizer.fit_transform(features_df['Description'].values.astype('U')) #The sole column in the CSV file is labeled "Description", so we specify that here
    
    
    ###### Encode the class Labels of the Training Data ######
    labels_file = "labels.csv" #The CSV file that contains the classification labels for each email
    labels_df = pd.read_csv(labels_file, encoding='ISO-8859-1')
    lab_enc = preprocessing.LabelEncoder()
    labels = lab_enc.fit_transform(labels_df)
    
    
    ###### Create a classifier and fit it to our Training Data ######
    clf = svm.SVC(gamma=0.01, C=100)
    clf.fit(features, labels)
    
    
    ###### Output persistent model files ######
    joblib.dump(clf, 'brain.pkl')
    joblib.dump(vectorizer, 'vectorizer.pkl')
    joblib.dump(lab_enc, 'lab_enc.pkl')
    print("Training completed.")
    

    预测计划:

    import numpy as np
    import os
    from sklearn import svm
    from sklearn import preprocessing
    from sklearn.feature_extraction.text import TfidfVectorizer
    from sklearn.externals import joblib
    
    
    ###### Load our model from our training program ######
    clf = joblib.load('brain.pkl')
    vectorizer = joblib.load('vectorizer.pkl')
    lab_enc = joblib.load('lab_enc.pkl')
    
    
    ###### Prompt user for input, then make a prediction ######
    print("Type an email's contents here and I will predict its category")
    newData = [input(">> ")]
    newDataFeatures = vectorizer.transform(newData)
    print("I predict the category is: ", lab_enc.inverse_transform(clf.predict(newDataFeatures)))
    
    
    ###### Feedback loop - Tell the machine whether or not it was correct, and have it learn from the response ######
    print("Was my prediction correct? y/n")
    feedback = input(">> ")
    
    inputValid = False
    while inputValid == False: 
    
        if feedback == "y" or feedback == "n":
            inputValid = True
        else:
            print("Response not understood. Was my prediction correct? y/n")
            feedback = input(">> ")
    
    if feedback == "y":
        print("I was correct. I'll incorporate this new data into my persistent model to aid in future predictions.")
        #refit the classifier using the new features and label
    elif feedback == "n":
        print("I was incorrect. What was the correct category?")
        correctAnswer = input(">> ")
        print("Got it. I'll incorporate this new data into my persistent model to aid in future predictions.")
        #refit the classifier using the new features and label
    

    从我所做的阅读中,我发现SVM并不真正支持增量学习,因此我认为我需要将新数据合并到旧的训练数据中并从头开始重新训练整个模型每次我都要添加新数据。这很好,但我不太确定如何实际实现它。我是否需要预测程序来更新两个CSV文件以包含新数据,以便重新开始培训?

1 个答案:

答案 0 :(得分:0)

我最终弄清楚我的问题的概念性答案是我需要更新我最初用来训练机器的CSV文件。收到反馈后,我只是将新功能和标签写入各自的CSV文件,然后使用训练数据集中包含的新信息重新训练机器。