在EMR集群上,随机森林执行中的crossValidation永远消失

时间:2019-06-03 15:52:04

标签: scala apache-spark random-forest amazon-emr cross-validation

我在Spark Scala上有一个针对RandomForest模型的书面代码。我的数据有2亿条记录和12个功能。它适用于简单的RandomFores,但是,即使使用采样数据和paramGridBuilder中每个参数的一个值,也可以使用GridBuilder和CrossValidation进行超参数调整,这是永远的!知道为什么吗?

我正在C5.2xlarge内存实例的EMR群集上运行代码。 1个主节点,4个代码节点和5个任务节点。我尝试了这段代码:

val stringIndexer_label = new StringIndexer().setInputCol("lbl").setOutputCol("label").fit(df_data)

val playerIndexer = new StringIndexer().setInputCol("player").setOutputCol("player_index").setHandleInvalid("keep")

val domain_bundleIndexer = new StringIndexer().setInputCol("domain_bundle").setOutputCol("domain_bundle_index").setHandleInvalid("keep")

val cityIndexer = new StringIndexer().setInputCol("city").setOutputCol("city_index").setHandleInvalid("keep")

val regionIndexer = new StringIndexer().setInputCol("region").setOutputCol("region_index").setHandleInvalid("keep")


// val categoricalColumns = Array(("device_id", "player", "bundle",  "city", "region")

//
val vectorAssembler_features = new VectorAssembler().
setInputCols(Array("player_index", "city_index","region_index","domain_bundle_index", "uid_type","advertiser_id" ,"dayweek", "hour", "video_duration", "exchange_id", "device_type_id", "ip", "user_agent_hash")).
setOutputCol("features")


////////////////////////////////
/////// RANDOM FOREST 


val rf = new RandomForestClassifier().setLabelCol("label").setFeaturesCol("features").setMaxBins(100000000).setNumTrees(101).setMaxDepth(8)

val labelConverter = new IndexToString().setInputCol("prediction").setOutputCol("predictedLabel").setLabels(stringIndexer_label.labels)

val pipelineRF = new Pipeline().setStages(Array(stringIndexer_label,domain_bundleIndexer, playerIndexer, cityIndexer, regionIndexer,vectorAssembler_features, rf))
//

val paramGridRF = new ParamGridBuilder().
// addGrid(rf.maxBins, Array(100,200)).
addGrid(rf.maxDepth, Array(4, 8,10)).
addGrid(rf.numTrees, Array( 11, 51, 101)).
addGrid(rf.impurity, Array("entropy", "gini")).
build()

// val paramGridRF = new ParamGridBuilder().
// // addGrid(rf.maxBins, Array(100)).
// addGrid(rf.maxDepth, Array(10)).
// addGrid(rf.numTrees, Array( 11)).
// addGrid(rf.impurity, Array("entropy")).
// build()


val evaluatorRF= new BinaryClassificationEvaluator().
setLabelCol("label").
setRawPredictionCol("prediction")

val crossvalRF = new CrossValidator().
setEstimator(pipelineRF).
setEvaluator(evaluatorRF).
setEstimatorParamMaps(paramGridRF).
//setNumFolds(3).
setCollectSubModels(true)

// **** model.subModels

val pipelineModelRF = crossvalRF.fit(training_data)

因为没有交叉验证的简单模型的执行时间大约需要10分钟,所以我希望上面的代码(带注释的paramGrid)需要3倍* 10分钟= 30分钟。但它永远不会结束。 有人可以给我一个建议吗?

0 个答案:

没有答案