经过sklearn训练的模型可以在pyspark数据框上运行

时间:2020-05-07 14:27:28

标签: python pyspark scikit-learn apache-spark-sql pickle

我正在研究文本分类问题,并使用了计数矢量化器和tfidf并将模型以及矢量化器保存在python中,我就这样保存了

X_train, X_test, y_train, y_test = train_test_split(df_upsampled['conv_cleaned'], df_upsampled['r1'], random_state = 0)
count_vect = CountVectorizer( max_df=.98,min_df=.001,  encoding='latin-1', ngram_range=(1, 3))
X_train_counts = count_vect.fit_transform(X_train)
tfidf_transformer = TfidfTransformer(sublinear_tf=True,norm='l2')
X_train_tfidf = tfidf_transformer.fit_transform(X_train_counts)
clf = LinearSVC().fit(X_train_tfidf, y_train)
# save the model to disk
import pickle
filename = 'Linearsvctrainedcallreason.pkl'
pickle.dump(clf, open(filename, 'wb'))
pickle.dump(count_vect, open("vectorizer.pickle", "wb"))

不,我想在pyspark数据框上使用它以便在生产中使用它 我尝试过但无法正常工作的代码 第1部分:

@f.udf(returnType=StringType())
def predict_f1(message):
    return pd.Series(loaded_model.predict(vectorizer.transform(message)))
temp_f1 = temp.withColumn('predicted_reason', predict_f1(temp.conv))

temp_f1.take(5)

第2部分:

predict_f2 = f.pandas_udf(lambda message: loaded_model.predict(vectorizer.transform(message)), StringType())
temp_f2 = temp.withColumn('predicted_reason', predict_f2(temp.conv))

temp_f2.take(5)

错误消息:

Py4JJavaError:调用o946.collectToPython时发生错误。 : org.apache.spark.SparkException:由于阶段失败,作业中止了: 阶段139.0中的任务0失败4次,最近一次失败:丢失的任务 在阶段139.0中为0.3(TID 18180,bdswr004x12h5.nam.nsroot.net,执行器2530):org.apache.spark.api.python.PythonException:Traceback(most 最近通话结束):文件 “ /opt/cloudera/parcels/SPARK2-2.4.0.cloudera2-1.cdh5.13.3.p3544.1321029/lib/spark2/python/lib/pyspark.zip/pyspark/worker.py”, 267行,在主要 (“%d。%d”%sys.version_info [:2],版本))

我有python:3.6.5 pyspark的:2.3.1 scikit学习:0.21.3 熊猫:0.22.0 金字塔:0.11.1

0 个答案:

没有答案