我在Spark Scala上有一个针对RandomForest模型的书面代码。我的数据有2亿条记录和12个功能。它适用于简单的RandomFores,但是,即使使用采样数据和paramGridBuilder中每个参数的一个值,也可以使用GridBuilder和CrossValidation进行超参数调整,这是永远的!知道为什么吗?
我正在C5.2xlarge内存实例的EMR群集上运行代码。 1个主节点,4个代码节点和5个任务节点。我尝试了这段代码:
val stringIndexer_label = new StringIndexer().setInputCol("lbl").setOutputCol("label").fit(df_data)
val playerIndexer = new StringIndexer().setInputCol("player").setOutputCol("player_index").setHandleInvalid("keep")
val domain_bundleIndexer = new StringIndexer().setInputCol("domain_bundle").setOutputCol("domain_bundle_index").setHandleInvalid("keep")
val cityIndexer = new StringIndexer().setInputCol("city").setOutputCol("city_index").setHandleInvalid("keep")
val regionIndexer = new StringIndexer().setInputCol("region").setOutputCol("region_index").setHandleInvalid("keep")
// val categoricalColumns = Array(("device_id", "player", "bundle", "city", "region")
//
val vectorAssembler_features = new VectorAssembler().
setInputCols(Array("player_index", "city_index","region_index","domain_bundle_index", "uid_type","advertiser_id" ,"dayweek", "hour", "video_duration", "exchange_id", "device_type_id", "ip", "user_agent_hash")).
setOutputCol("features")
////////////////////////////////
/////// RANDOM FOREST
val rf = new RandomForestClassifier().setLabelCol("label").setFeaturesCol("features").setMaxBins(100000000).setNumTrees(101).setMaxDepth(8)
val labelConverter = new IndexToString().setInputCol("prediction").setOutputCol("predictedLabel").setLabels(stringIndexer_label.labels)
val pipelineRF = new Pipeline().setStages(Array(stringIndexer_label,domain_bundleIndexer, playerIndexer, cityIndexer, regionIndexer,vectorAssembler_features, rf))
//
val paramGridRF = new ParamGridBuilder().
// addGrid(rf.maxBins, Array(100,200)).
addGrid(rf.maxDepth, Array(4, 8,10)).
addGrid(rf.numTrees, Array( 11, 51, 101)).
addGrid(rf.impurity, Array("entropy", "gini")).
build()
// val paramGridRF = new ParamGridBuilder().
// // addGrid(rf.maxBins, Array(100)).
// addGrid(rf.maxDepth, Array(10)).
// addGrid(rf.numTrees, Array( 11)).
// addGrid(rf.impurity, Array("entropy")).
// build()
val evaluatorRF= new BinaryClassificationEvaluator().
setLabelCol("label").
setRawPredictionCol("prediction")
val crossvalRF = new CrossValidator().
setEstimator(pipelineRF).
setEvaluator(evaluatorRF).
setEstimatorParamMaps(paramGridRF).
//setNumFolds(3).
setCollectSubModels(true)
// **** model.subModels
val pipelineModelRF = crossvalRF.fit(training_data)
因为没有交叉验证的简单模型的执行时间大约需要10分钟,所以我希望上面的代码(带注释的paramGrid)需要3倍* 10分钟= 30分钟。但它永远不会结束。 有人可以给我一个建议吗?