将新的拟合阶段添加到退出的PipelineModel中,而无需再次拟合

时间:2018-11-12 19:45:33

标签: apache-spark pipeline apache-spark-ml apache-spark-2.0

我想将多条经过训练的管道连接到一条,这类似于 “ Spark add new fitted stage to a exitsting PipelineModel without fitting again”,但是以下解决方案适用于PySpark。

> pipe_model_new = PipelineModel(stages = [pipe_model , pipe_model2])
> final_df = pipe_model_new.transform(df1)

在Apache Spark 2.0中,“ PipelineModel”的构造函数被标记为私有,因此无法在外部调用。在“ Pipeline”类中,只有“ fit”方法会创建“ PipelineModel”

val pipelineModel =  new PipelineModel("randomUID", trainedStages)
val df_final_full = pipelineModel.transform(df)
Error:(266, 26) constructor PipelineModel in class PipelineModel cannot be accessed in class Preprocessor
    val pipelineModel =  new PipelineModel("randomUID", trainedStages)

1 个答案:

答案 0 :(得分:1)

with using Pipeline和调用fit方法都没有错*。如果阶段是Transfomer,而PipelineModel是**,则fit的作用就像标识。

您可以检查relevant Python

if isinstance(stage, Transformer):
    transformers.append(stage)
    dataset = stage.transform(dataset)

Scala code

这意味着拟合过程将仅验证模式并创建新的PipelineModel对象。

case t: Transformer =>
  t

*唯一可能的担忧是存在非惰性Transformers,不过,除已弃用的OneHotEncoder之外,Spark核心API均未提供这种功能。

**在Python中:

from pyspark.ml import Transformer, PipelineModel

issubclass(PipelineModel, Transformer)
True 

在Scala中

import scala.reflect.runtime.universe.typeOf
import org.apache.spark.ml._

typeOf[PipelineModel] <:< typeOf[Transformer]
Boolean = true