在pyspark ml管道中保存自定义Transformer

时间:2018-10-11 08:46:49

标签: python apache-spark pyspark apache-spark-ml

我们试图使用 PySpark 版本2.2.0

使用自定义变形器保存Spark ML管道

代码...

这是简单的自定义转换器,在这种情况下,它没有任何作用:

from pyspark.ml import Transformer, Pipeline
from pyspark.ml.param.shared import HasInputCol, HasOutputCol, Param

class CustomTransformer(Transformer, HasInputCol, HasOutputCol):
  def __init__(self, inputCol=None, outputCol=None):
    super(CustomTransformer, self).__init__()

  def _transform(self, dataset):
    # ... do some thing...
    return dataset


df = spark.createDataFrame([
  (0, "Hi I heard about Spark"),
  (0, "I wish Java could use case classes"),
  (1, "Logistic regression models are neat")
], ["label", "sentence"])

customTransformer = CustomTransformer(inputCol="label", outputCol="sentence_2")
pipeline = Pipeline(stages=[customTransformer])
model = pipeline.fit(df)

model.save("save-pipeline-out-test")  # <--- FAILED

例外

运行此命令后,它会失败,并出现以下异常:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/.../python/pyspark/ml/pipeline.py", line 217, in save
    self.write().save(path)
  File "/.../python/pyspark/ml/pipeline.py", line 212, in write
    return JavaMLWriter(self)
  File "/.../python/pyspark/ml/util.py", line 99, in __init__
    _java_obj = instance._to_java()
  File "/.../python/pyspark/ml/pipeline.py", line 249, in _to_java
    java_stages[idx] = stage._to_java()
AttributeError: 'CustomTransformer' object has no attribute '_to_java'

问题

什么是最佳方法/如何在 PySpark 中保存自定义转换器?

0 个答案:

没有答案