我一般对机器学习还很陌生,并且想将我的模型存储在云中以便进行在线预测。
我在本地使用Jupyter Notebook在Scikit-learn上使用TfIdf vecotrizer(用于情感分析)在Google AI平台上使用他们的Training Job功能成功地训练了Logistic回归模型。
我必须提到,在我的培训包setup.py文件中包括bs4,nltk,lxml作为必需的PyPI包。
我的训练算法如下:
将输入字符串及其标签(输出)的CSV文件作为pandas数据框导入(模型具有1个输入变量,即字符串)。
使用bs4和nltk预处理输入字符串,以删除不必要的字符,停用词,并将所有字符都转换为小写(要重现此内容,只需使用仅使用小写字母的字符串)。
创建管道
//initialize components in the Neural Net
public void buildNet() {
//transpose given input and expected output (cuts out some steps later)
mInput = transpose(mInput);
mExpected = transpose(mExpected);
//mNumExpect is the amount of training data given
//mNumHiddenNodes, mNumInputNodes, mNumOutputNodes are the number of nodes in each respective layer
//initialize layer 1 weights and biases as matrices
mW1 = new double[mNumHiddenNodes][mNumInputNodes];
mB1 = new double[mNumHiddenNodes][mNumExpected];
//initialize layer 2 weights and biases as matrices
mW2 = new double[mNumOutputNodes][mNumHiddenNodes];
mB2 = new double[mNumOutputNodes][mNumExpected];
//assign random values to layer 1 weights and biases
mW1 = random(mNumHiddenNodes, mNumInputNodes);
mB1 = random(mNumHiddenNodes,mNumExpected);
//assign random values to layer 2 weights and biases
mW2 = random(mNumOutputNodes, mNumHiddenNodes);
mB2 = random(mNumOutputNodes,mNumExpected);
}
public void fowardPropogate()
{
//input layer to layer 2
double[][] z1 = add(dot(mW1,mInput), mB1);
mA1 = sigmoid(z1);
//layer 2 to output layer
double[][] z2 = add(dot(mW2,mA1), mB2);
mA2 = sigmoid(z2);
//use activation function to calculate cost
cost = cross_entropy(mNumExpected, mExpected, mA2);
}
public void backPropogate()
{
//layer 2
double[][] dZ2 = subtract(mA2, mExpected);
double[][] dW2 = scalarDivide(dot(dZ2, transpose(mA1)), mNumExpected);
double[][] dB2 = scalarDivide(dZ2, mNumExpected);
//layer 1
double[][] dZ1 = elementWiseMult(dot(transpose(mW2), dZ2), scalarSubtract(power(mA1, 2), 1.0));
double[][] dW1 = scalarDivide(dot(dZ1, transpose(mInput)), mNumExpected);
double[][] dB1 = scalarDivide(dZ1,mNumExpected);
//adjustments layers 1-2
mW1 = subtract(mW1, scalarMult(dW1, trainingRate));
mB1 = subtract(mB1, scalarMult(dB1, trainingRate));
//adjustments layers 1-output
mW2 = subtract(mW2, scalarMult(dW2, trainingRate));
mB2 = subtract(mB2, scalarMult(dB2, trainingRate));
}
使用GridSearchCV进行交叉验证
from sklearn.feature_extraction.text import TfidfVectorizer
tvec=TfidfVectorizer()
lclf = LogisticRegression(fit_intercept = False, random_state = 255, max_iter = 1000)
from sklearn.pipeline import Pipeline
model_1= Pipeline([('vect',tvec),('clf',lclf)])
以最好的估计获得我想要的模型。这是保存在Google model.joblib文件中的模型。
from sklearn.model_selection import GridSearchCV
param_grid = [{'vect__ngram_range' : [(1, 1)],
'clf__penalty' : ['l1', 'l2'],
'clf__C' : [1.0, 10.0, 100.0]},
{'vect__ngram_range' : [(1, 1)],
'clf__penalty' : ['l1', 'l2'],
'clf__C' : [1.0, 10.0, 100.0],
'vect__use_idf' : [False],
'vect__norm' : [False]}]
gs_lr_tfidf = GridSearchCV(model_1, param_grid, scoring='accuracy',
cv=5, verbose=1, n_jobs=-1)
gs_lr_tfidf.fit(X_train, y_train)
我可以使用
在Jupyter Notebook文件上输出一个简单的预测clf = gs_lr_tfidf.best_estimator_
它将为我的输入字符串打印预测的标签。例如['好']或['坏']
但是在成功训练模型并将其提交给AI平台的同时,当我尝试请求诸如(以所需JSON格式)的预测时:
predicted = clf.predict(["INPUT STRING"])
print(predicted)
shell返回此错误:
["the quick brown fox jumps over the lazy dog"]
["hi what is up"]
这里可能出了什么问题?
依赖性可能是一个问题,我也必须在我的Google托管模型中安装bs4,lxml和nltk的软件包吗?
或者我输入的JSON格式不正确?
感谢您的帮助。
答案 0 :(得分:0)
好的,我发现确实JSON格式的格式错误。 (在https://stackoverflow.com/a/51693619/10570541上回答)
与官方文档一样,JSON格式具有换行符和方括号来分隔实例,例如:
[6.8, 2.8, 4.8, 1.4]
[6.0, 3.4, 4.5, 1.6]
如果您有多个输入变量,则适用。
仅对于一个输入变量,只需使用换行符即可。
"the quick brown fox jumps over the lazy dog"
"alright it works"