Question

I'm trying to serialize a PySpark Pipeline object so that it can be saved and retrieved later. Tried using the Python pickle library as well as the PySpark's PickleSerializer, the dumps() call itself is failing.

Providing the code snippet while using native pickle library.

pipeline = Pipeline(stages=[tokenizer, hashingTF, lr])
with open ('myfile', 'wb') as f:
   pickle.dump(pipeline,f,2)
with open ('myfile', 'rb') as f:
   pipeline1 = pickle.load(f)

Getting the below error while running:

py4j.protocol.Py4JError: An error occurred while calling o32.__getnewargs__. Trace:
py4j.Py4JException: Method __getnewargs__([]) does not exist
    at py4j.reflection.ReflectionEngine.getMethod(ReflectionEngine.java:335)
    at py4j.reflection.ReflectionEngine.getMethod(ReflectionEngine.java:344)
    at py4j.Gateway.invoke(Gateway.java:252)
    at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:133)
    at py4j.commands.CallCommand.execute(CallCommand.java:79)
    at py4j.GatewayConnection.run(GatewayConnection.java:209)
    at java.lang.Thread.run(Thread.java:785)

Is it possible to serialize PySpark Pipeline objects ?

Answer 1

从技术上讲，你可以轻松挑选Pipeline对象：

from pyspark.ml.pipeline import Pipeline
import pickle

pickle.dumps(Pipeline(stages=[]))
## b'\x80\x03cpyspark.ml.pipeline\nPipeline\nq ...

你不能挑剔的是Spark Transformers和Estimators，它们只是JVM对象周围的瘦包装器。如果你真的需要这个，你可以将它包装在一个函数中，例如：

def make_pipeline():
    return Pipeline(stages=[Tokenizer(inputCol="text", outputCol="words")])

pickle.dumps(make_pipeline)
## b'\x80\x03c__ ...

但由于它只是一段代码并且不存储任何持久性数据，因此它看起来并不特别有用。

How to serialize a pyspark Pipeline object?

1 个答案: