Sklearn分类器和烧瓶问题

时间:2017-07-28 04:16:40

标签: python apache web-applications flask scikit-learn

我一直尝试使用apachesklearn分类器自我托管,我最后使用joblib序列化已保存的模型,然后将其加载到烧瓶中应用程序。现在,这个应用程序在运行flask的内置开发服务器时工作得很好,但是当我用debian 9 apache服务器设置它时,我得到500错误。深入研究apache的error.log,我得到:

AttributeError: module '__main__' has no attribute 'tokenize'

现在,这对我来说很有趣,因为虽然我写了自己的tokenizer,但当我在本地运行它时,Web应用程序没有给我带来任何问题。此外,我使用的保存模型是在网络服务器上训练的,所以略有不同的库版本应该不是问题。

我的网络应用代码是:

import re
import sys

from flask import Flask, request, render_template
from nltk import word_tokenize
from nltk.stem.wordnet import WordNetLemmatizer
from sklearn.externals import joblib

app = Flask(__name__)



def tokenize(text):
    # text = text.translate(str.maketrans('','',string.punctuation))
    text = re.sub(r'\W+', ' ', text)
    tokens = word_tokenize(text)
    lemas = []
    for item in tokens:
        lemas.append(WordNetLemmatizer().lemmatize(item))
    return lemas

@app.route('/')
def home():
    return render_template('home.html')

@app.route('/analyze',methods=['POST','GET'])
def analyze():
    if request.method=='POST':
        result=request.form
        input_text = result['input_text']

        clf = joblib.load("model.pkl.z")
        parameters = clf.named_steps['clf'].get_params()
        predicted = clf.predict([input_text])
        # print(predicted)
        certainty = clf.decision_function([input_text])

        # Is it bonkers?
        if predicted[0]:
            verdict = "Not too nuts!"
        else:
            verdict = "Bonkers!"

        return render_template('result.html',prediction=[input_text, verdict, float(certainty), parameters])

if __name__ == '__main__':
    #app.debug = True
    app.run()

.wsgi文件为:

import sys 
sys.path.append('/var/www/mysite')

from conspiracydetector import app as application

此外,我使用以下代码训练模型:

import logging
import pprint  # Pretty stuff
import re
import sys  # For command line arguments
from time import time  # to show progress

import numpy as np
from nltk import word_tokenize
from nltk.stem.wordnet import WordNetLemmatizer
from sklearn import metrics
from sklearn.datasets import load_files
from sklearn.externals import joblib  # In order to save
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
from sklearn.svm import LinearSVC

# Tokenizer that does stemming and strips punctuation
def tokenize(text):
    # text = text.translate(str.maketrans('','',string.punctuation))
    text = re.sub(r'\W+', ' ', text)
    tokens = word_tokenize(text)
    lemas = []
    for item in tokens:
        lemas.append(WordNetLemmatizer().lemmatize(item))
    return lemas

if __name__ == "__main__":
    # NOTE: we put the following in a 'if __name__ == "__main__"' protected
    # block to be able to use a multi-core grid search that also works under
    # Windows, see: http://docs.python.org/library/multiprocessing.html#windows
    # The multiprocessing module is used as the backend of joblib.Parallel
    # that is used when n_jobs != 1 in GridSearchCV

    # Display progress logs on stdout
    print("Initializing...")
    # Command line arguments
    save = sys.argv[1]
    training_directory = sys.argv[2]

    logging.basicConfig(level=logging.INFO,
                        format='%(asctime)s %(levelname)s %(message)s')

    dataset = load_files(training_directory, shuffle=False)
    print("n_samples: %d" % len(dataset.data))

    # split the dataset in training and test set:
    print("Splitting the dataset in training and test set...")
    docs_train, docs_test, y_train, y_test = train_test_split(
        dataset.data, dataset.target, test_size=0.25, random_state=None)

    # Build a vectorizer / classifier pipeline that filters out tokens
    # that are too rare or too frequent
    # Also remove stop words
    print("Loading list of stop words...")
    with open('stopwords.txt', 'r') as f:
        words = [line.strip() for line in f]

    print("Stop words list loaded...")
    print("Setting up pipeline...")
    pipeline = Pipeline(
        [
            # ('vect', TfidfVectorizer(stop_words=words, min_df=0.001, max_df=0.5, ngram_range=(1,1))),
            ('vect',
             TfidfVectorizer(tokenizer=tokenize, stop_words=words, min_df=0.001, max_df=0.5, ngram_range=(1, 1))),
            ('clf', LinearSVC(C=5000)),
        ])

    print("Pipeline:", [name for name, _ in pipeline.steps])

    # Build a grid search to find out whether unigrams or bigrams are
    # more useful.
    # Fit the pipeline on the training set using grid search for the parameters
    print("Initializing grid search...")

    # uncommenting more parameters will give better exploring power but will
    # increase processing time in a combinatorial way
    parameters = {
        # 'vect__ngram_range': [(1, 1), (1, 2)],
        # 'vect__min_df': (0.0005, 0.001),
        # 'vect__max_df': (0.25, 0.5),
        # 'clf__C': (10, 15, 20),
    }
    print("Parameters:")
    pprint.pprint(parameters)
    grid_search = GridSearchCV(
        pipeline,
        parameters,
        n_jobs=-1,
        verbose=True)

    print("Training and performing grid search...\n")
    t0 = time()
    grid_search.fit(docs_train, y_train)
    print("\nDone in %0.3fs!\n" % (time() - t0))

    # Print the mean and std for each candidate along with the parameter
    # settings for all the candidates explored by grid search.
    n_candidates = len(grid_search.cv_results_['params'])
    for i in range(n_candidates):
        print(i, 'params - %s; mean - %0.2f; std - %0.2f'
              % (grid_search.cv_results_['params'][i],
                 grid_search.cv_results_['mean_test_score'][i],
                 grid_search.cv_results_['std_test_score'][i]))

    # Predict the outcome on the testing set and store it in a variable
    # named y_predicted
    print("\nRunning against testing set...\n")
    y_predicted = grid_search.predict(docs_test)

    # Save model
    print("\nSaving model to", save, "...")
    joblib.dump(grid_search.best_estimator_, save)
    print("Model Saved! \nPrepare for some awesome stats!")

我必须承认我很难过,经过修修补补,搜索并确保我的服务器配置正确后,我觉得也许这里有人可以提供帮助。 任何帮助表示赞赏,如果我需要提供更多信息,请告诉我,我将很高兴。

另外,我正在运行:

  • python 3.5.3 with nltk and sklearn。

1 个答案:

答案 0 :(得分:1)

我解决了这个问题,虽然不完美,但删除了我的自定义标记符并回到了一个sklearn上。

但是,我仍然不知道如何整合我自己的标记器。