在不使用spark-ml中的管道的情况下,交叉验证是否更快?

时间:2018-07-24 09:53:04

标签: pyspark pipeline cross-validation apache-spark-ml

假设我在要素工程中有很多步骤:我的管道中将有许多变压器。我想知道在管道的交叉验证期间,Spark如何处理这些转换器:它们是否针对每个折叠执行?在对模型进行交叉验证之前应用变压器会更快吗?

其中哪些工作流程最快(或者有更好的解决方案)?

1。管道上的交叉验证器

transformer1 = ...
transformer2 = ...
transformer3 = ...
lr = LogisticRegression(...)
pipeline = Pipeline(stages=[transformer1, transformer2, transformer3, lr])
crossval = CrossValidator(estimator=pipeline, numFolds=10, ...)

cvModel = crossval.fit(training)
prediction = cvModel.transform(test)

2。流水线之后的交叉验证器

transformer1 = ...
transformer2 = ...
transformer3 = ...
pipeline = Pipeline(stages=[transformer1, transformer2, transformer3])
training_trans = pipeline.fit(training).transform(training)

lr = LogisticRegression(...)
crossval = CrossValidator(estimator=lr, numFolds=10, ...)

cvModel = crossval.fit(training_trans)
prediction = cvModel.transform(test)

最后,我对使用缓存存在相同的问题:在 2。中,我可以在进行交叉验证之前缓存training_trans。在 1。中,我可以在LogisticRegression之前在管道中使用Cacher转换器。 (有关Cacher,请参见Caching intermediate results in Spark ML pipeline

2 个答案:

答案 0 :(得分:0)

我已经做过实验,但是我仍然很感兴趣是否有人可以给出更详细的答案。

%%time
pipeline1 = Pipeline(stages=stringIndexers+oneHotEncoders+[vectorAssembler])
train2 = pipeline1.fit(train).transform(train)
crossval = CrossValidator(estimator=logisticRegression, ...)
crossval.fit(train2)
  

CPU时间:用户508毫秒,系统时间:136毫秒,总计:644毫秒/挂墙时间:2分2秒

%%time
pipeline1 = Pipeline(stages=stringIndexers+oneHotEncoders+[vectorAssembler])
train2 = pipeline1.fit(train).transform(train)
train2.cache().count()
crossval = CrossValidator(estimator=logisticRegression, ...)
crossval.fit(train2)
  

CPU时间:用户560毫秒,系统:104毫秒,总计:664毫秒/挂墙时间:1分钟25秒

%%time
pipeline2 = Pipeline(stages=stringIndexers+oneHotEncoders+[vectorAssembler, logisticRegression])
crossval = CrossValidator(estimator=pipeline2, ...)
crossval.fit(train)
  

CPU时间:用户2.06 s,sys:504 ms,总计:2.56 s /挂墙时间:3min

答案 1 :(得分:0)

根据我最近参加的spark.ml培训,建议遵循以下方法:

    Failed to execute goal io.quarkus:quarkus-maven-plugin:1.5.2.Final:build (default) on project ####: Failed to build quarkus application: io.quarkus.builder.BuildException: Build failure: Build failed due to errors
   [error]: Build step io.quarkus.hibernate.orm.panache.deployment.PanacheHibernateResourceProcessor#validate threw an exception: java.lang.IllegalStateException: io.quarkus.builder.BuildException: Build failure: The class java.lang.Enum is not inside the Jandex index
    (...)
    Caused by: io.quarkus.builder.BuildException: Build failure: The class java.lang.Enum is not inside the Jandex index
        
        [ERROR]     at io.quarkus.panache.common.deployment.JandexUtil.isSubclassOf(JandexUtil.java:348)
        [ERROR]     at io.quarkus.hibernate.orm.panache.deployment.PanacheHibernateResourceProcessor.validate(PanacheHibernateResourceProcessor.java:177)  
        [ERROR]     at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        [ERROR]     at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
        [ERROR]     at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        [ERROR]     at java.base/java.lang.reflect.Method.invoke(Method.java:566)
        [ERROR]     at io.quarkus.deployment.ExtensionLoader$2.execute(ExtensionLoader.java:932)

希望这会有所帮助!