我们试图使用 PySpark 版本2.2.0
使用自定义变形器保存Spark ML管道这是简单的自定义转换器,在这种情况下,它没有任何作用:
from pyspark.ml import Transformer, Pipeline
from pyspark.ml.param.shared import HasInputCol, HasOutputCol, Param
class CustomTransformer(Transformer, HasInputCol, HasOutputCol):
def __init__(self, inputCol=None, outputCol=None):
super(CustomTransformer, self).__init__()
def _transform(self, dataset):
# ... do some thing...
return dataset
df = spark.createDataFrame([
(0, "Hi I heard about Spark"),
(0, "I wish Java could use case classes"),
(1, "Logistic regression models are neat")
], ["label", "sentence"])
customTransformer = CustomTransformer(inputCol="label", outputCol="sentence_2")
pipeline = Pipeline(stages=[customTransformer])
model = pipeline.fit(df)
model.save("save-pipeline-out-test") # <--- FAILED
运行此命令后,它会失败,并出现以下异常:
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/.../python/pyspark/ml/pipeline.py", line 217, in save
self.write().save(path)
File "/.../python/pyspark/ml/pipeline.py", line 212, in write
return JavaMLWriter(self)
File "/.../python/pyspark/ml/util.py", line 99, in __init__
_java_obj = instance._to_java()
File "/.../python/pyspark/ml/pipeline.py", line 249, in _to_java
java_stages[idx] = stage._to_java()
AttributeError: 'CustomTransformer' object has no attribute '_to_java'
什么是最佳方法/如何在 PySpark 中保存自定义转换器?