假设我在要素工程中有很多步骤:我的管道中将有许多变压器。我想知道在管道的交叉验证期间,Spark如何处理这些转换器:它们是否针对每个折叠执行?在对模型进行交叉验证之前应用变压器会更快吗?
其中哪些工作流程最快(或者有更好的解决方案)?
transformer1 = ...
transformer2 = ...
transformer3 = ...
lr = LogisticRegression(...)
pipeline = Pipeline(stages=[transformer1, transformer2, transformer3, lr])
crossval = CrossValidator(estimator=pipeline, numFolds=10, ...)
cvModel = crossval.fit(training)
prediction = cvModel.transform(test)
transformer1 = ...
transformer2 = ...
transformer3 = ...
pipeline = Pipeline(stages=[transformer1, transformer2, transformer3])
training_trans = pipeline.fit(training).transform(training)
lr = LogisticRegression(...)
crossval = CrossValidator(estimator=lr, numFolds=10, ...)
cvModel = crossval.fit(training_trans)
prediction = cvModel.transform(test)
最后,我对使用缓存存在相同的问题:在 2。中,我可以在进行交叉验证之前缓存training_trans。在 1。中,我可以在LogisticRegression之前在管道中使用Cacher
转换器。 (有关Cacher,请参见Caching intermediate results in Spark ML pipeline)
答案 0 :(得分:0)
我已经做过实验,但是我仍然很感兴趣是否有人可以给出更详细的答案。
%%time
pipeline1 = Pipeline(stages=stringIndexers+oneHotEncoders+[vectorAssembler])
train2 = pipeline1.fit(train).transform(train)
crossval = CrossValidator(estimator=logisticRegression, ...)
crossval.fit(train2)
CPU时间:用户508毫秒,系统时间:136毫秒,总计:644毫秒/挂墙时间:2分2秒
%%time
pipeline1 = Pipeline(stages=stringIndexers+oneHotEncoders+[vectorAssembler])
train2 = pipeline1.fit(train).transform(train)
train2.cache().count()
crossval = CrossValidator(estimator=logisticRegression, ...)
crossval.fit(train2)
CPU时间:用户560毫秒,系统:104毫秒,总计:664毫秒/挂墙时间:1分钟25秒
%%time
pipeline2 = Pipeline(stages=stringIndexers+oneHotEncoders+[vectorAssembler, logisticRegression])
crossval = CrossValidator(estimator=pipeline2, ...)
crossval.fit(train)
CPU时间:用户2.06 s,sys:504 ms,总计:2.56 s /挂墙时间:3min
答案 1 :(得分:0)
根据我最近参加的spark.ml培训,建议遵循以下方法:
Failed to execute goal io.quarkus:quarkus-maven-plugin:1.5.2.Final:build (default) on project ####: Failed to build quarkus application: io.quarkus.builder.BuildException: Build failure: Build failed due to errors
[error]: Build step io.quarkus.hibernate.orm.panache.deployment.PanacheHibernateResourceProcessor#validate threw an exception: java.lang.IllegalStateException: io.quarkus.builder.BuildException: Build failure: The class java.lang.Enum is not inside the Jandex index
(...)
Caused by: io.quarkus.builder.BuildException: Build failure: The class java.lang.Enum is not inside the Jandex index
[ERROR] at io.quarkus.panache.common.deployment.JandexUtil.isSubclassOf(JandexUtil.java:348)
[ERROR] at io.quarkus.hibernate.orm.panache.deployment.PanacheHibernateResourceProcessor.validate(PanacheHibernateResourceProcessor.java:177)
[ERROR] at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
[ERROR] at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
[ERROR] at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
[ERROR] at java.base/java.lang.reflect.Method.invoke(Method.java:566)
[ERROR] at io.quarkus.deployment.ExtensionLoader$2.execute(ExtensionLoader.java:932)
希望这会有所帮助!