我一直尝试使用apache
和sklearn
分类器自我托管,我最后使用joblib
序列化已保存的模型,然后将其加载到烧瓶中应用程序。现在,这个应用程序在运行flask的内置开发服务器时工作得很好,但是当我用debian 9 apache服务器设置它时,我得到500错误。深入研究apache的error.log
,我得到:
AttributeError: module '__main__' has no attribute 'tokenize'
现在,这对我来说很有趣,因为虽然我写了自己的tokenizer,但当我在本地运行它时,Web应用程序没有给我带来任何问题。此外,我使用的保存模型是在网络服务器上训练的,所以略有不同的库版本应该不是问题。
我的网络应用代码是:
import re
import sys
from flask import Flask, request, render_template
from nltk import word_tokenize
from nltk.stem.wordnet import WordNetLemmatizer
from sklearn.externals import joblib
app = Flask(__name__)
def tokenize(text):
# text = text.translate(str.maketrans('','',string.punctuation))
text = re.sub(r'\W+', ' ', text)
tokens = word_tokenize(text)
lemas = []
for item in tokens:
lemas.append(WordNetLemmatizer().lemmatize(item))
return lemas
@app.route('/')
def home():
return render_template('home.html')
@app.route('/analyze',methods=['POST','GET'])
def analyze():
if request.method=='POST':
result=request.form
input_text = result['input_text']
clf = joblib.load("model.pkl.z")
parameters = clf.named_steps['clf'].get_params()
predicted = clf.predict([input_text])
# print(predicted)
certainty = clf.decision_function([input_text])
# Is it bonkers?
if predicted[0]:
verdict = "Not too nuts!"
else:
verdict = "Bonkers!"
return render_template('result.html',prediction=[input_text, verdict, float(certainty), parameters])
if __name__ == '__main__':
#app.debug = True
app.run()
.wsgi文件为:
import sys
sys.path.append('/var/www/mysite')
from conspiracydetector import app as application
此外,我使用以下代码训练模型:
import logging
import pprint # Pretty stuff
import re
import sys # For command line arguments
from time import time # to show progress
import numpy as np
from nltk import word_tokenize
from nltk.stem.wordnet import WordNetLemmatizer
from sklearn import metrics
from sklearn.datasets import load_files
from sklearn.externals import joblib # In order to save
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
from sklearn.svm import LinearSVC
# Tokenizer that does stemming and strips punctuation
def tokenize(text):
# text = text.translate(str.maketrans('','',string.punctuation))
text = re.sub(r'\W+', ' ', text)
tokens = word_tokenize(text)
lemas = []
for item in tokens:
lemas.append(WordNetLemmatizer().lemmatize(item))
return lemas
if __name__ == "__main__":
# NOTE: we put the following in a 'if __name__ == "__main__"' protected
# block to be able to use a multi-core grid search that also works under
# Windows, see: http://docs.python.org/library/multiprocessing.html#windows
# The multiprocessing module is used as the backend of joblib.Parallel
# that is used when n_jobs != 1 in GridSearchCV
# Display progress logs on stdout
print("Initializing...")
# Command line arguments
save = sys.argv[1]
training_directory = sys.argv[2]
logging.basicConfig(level=logging.INFO,
format='%(asctime)s %(levelname)s %(message)s')
dataset = load_files(training_directory, shuffle=False)
print("n_samples: %d" % len(dataset.data))
# split the dataset in training and test set:
print("Splitting the dataset in training and test set...")
docs_train, docs_test, y_train, y_test = train_test_split(
dataset.data, dataset.target, test_size=0.25, random_state=None)
# Build a vectorizer / classifier pipeline that filters out tokens
# that are too rare or too frequent
# Also remove stop words
print("Loading list of stop words...")
with open('stopwords.txt', 'r') as f:
words = [line.strip() for line in f]
print("Stop words list loaded...")
print("Setting up pipeline...")
pipeline = Pipeline(
[
# ('vect', TfidfVectorizer(stop_words=words, min_df=0.001, max_df=0.5, ngram_range=(1,1))),
('vect',
TfidfVectorizer(tokenizer=tokenize, stop_words=words, min_df=0.001, max_df=0.5, ngram_range=(1, 1))),
('clf', LinearSVC(C=5000)),
])
print("Pipeline:", [name for name, _ in pipeline.steps])
# Build a grid search to find out whether unigrams or bigrams are
# more useful.
# Fit the pipeline on the training set using grid search for the parameters
print("Initializing grid search...")
# uncommenting more parameters will give better exploring power but will
# increase processing time in a combinatorial way
parameters = {
# 'vect__ngram_range': [(1, 1), (1, 2)],
# 'vect__min_df': (0.0005, 0.001),
# 'vect__max_df': (0.25, 0.5),
# 'clf__C': (10, 15, 20),
}
print("Parameters:")
pprint.pprint(parameters)
grid_search = GridSearchCV(
pipeline,
parameters,
n_jobs=-1,
verbose=True)
print("Training and performing grid search...\n")
t0 = time()
grid_search.fit(docs_train, y_train)
print("\nDone in %0.3fs!\n" % (time() - t0))
# Print the mean and std for each candidate along with the parameter
# settings for all the candidates explored by grid search.
n_candidates = len(grid_search.cv_results_['params'])
for i in range(n_candidates):
print(i, 'params - %s; mean - %0.2f; std - %0.2f'
% (grid_search.cv_results_['params'][i],
grid_search.cv_results_['mean_test_score'][i],
grid_search.cv_results_['std_test_score'][i]))
# Predict the outcome on the testing set and store it in a variable
# named y_predicted
print("\nRunning against testing set...\n")
y_predicted = grid_search.predict(docs_test)
# Save model
print("\nSaving model to", save, "...")
joblib.dump(grid_search.best_estimator_, save)
print("Model Saved! \nPrepare for some awesome stats!")
我必须承认我很难过,经过修修补补,搜索并确保我的服务器配置正确后,我觉得也许这里有人可以提供帮助。 任何帮助表示赞赏,如果我需要提供更多信息,请告诉我,我将很高兴。
另外,我正在运行:
答案 0 :(得分:1)
我解决了这个问题,虽然不完美,但删除了我的自定义标记符并回到了一个sklearn上。
但是,我仍然不知道如何整合我自己的标记器。