我正在使用ML(org.apache.spark.ml.regression)算法进行机器学习。 下面是我的代码,它会永远运行,即使在一小时后也不会结束:
import org.apache.spark.sql.SQLContext
import org.apache.spark.ml.feature.{StringIndexer, VectorAssembler, OneHotEncoder}
import org.apache.spark.ml.Pipeline
import org.apache.spark.ml.regression.GBTRegressor
import org.apache.spark.ml.regression.GBTRegressionModel
import org.apache.spark.sql.types._
val df_train = sqlContext.read.format("com.databricks.spark.csv").option("header", "true").option("inferSchema", "true").load(data_path_train).select(all_toUse_name.head, all_toUse_name.tail: _*).withColumn(target_col_name, col(target_col_name) cast DoubleType)
val index_transformers: Array[org.apache.spark.ml.PipelineStage] = char_col_toUse_names.map(
cname => new StringIndexer()
.setInputCol(cname)
.setOutputCol("int_"+cname)
)
val one_hot_encoders: Array[org.apache.spark.ml.PipelineStage] = char_col_toUse_names.map(
cname => new OneHotEncoder()
.setInputCol("int_"+cname)
.setOutputCol("one_hot_"+cname)
)
var one_hot_col_names = char_col_toUse_names.clone
for((x,i) <- char_col_toUse_names.view.zipWithIndex) one_hot_col_names(i) = "one_hot_"+x
val col_name = one_hot_col_names++num_col_toUse_names
val assembler = new VectorAssembler().setInputCols(col_name).setOutputCol("features")
val gbt_model = new GBTRegressor().setLabelCol(target_col_name).setFeaturesCol("features").setMaxIter(2).setMaxDepth(2)
var stages: Array[org.apache.spark.ml.PipelineStage] = index_transformers ++ one_hot_encoders :+ assembler :+ gbt_model
val pipeline = new Pipeline().setStages(stages)
val trained_model = pipeline.fit(df_train)
当我使用MLLIB(训练模型不是管道的一部分)时,我遇到了同样的问题,但这是使用sc.parallelize解决的(如最后一行所示)。使用org.apache.spark.mllib.tree.GradientBoostedTrees完成相同的数据现在需要不到5分钟。
注意:以下代码在许多地方可能都是错误的,但是当我们添加sc.parallelize时它会结束,当我们使用sc.parallelize时它也会永远挂起。
import org.apache.spark.sql.SQLContext
import org.apache.spark.ml.feature.{StringIndexer, VectorAssembler, OneHotEncoder}
import org.apache.spark.ml.Pipeline
import org.apache.spark.mllib.regression.LabeledPoint
import org.apache.spark.mllib.linalg.{Vector, Vectors}
import org.apache.spark.mllib.tree.GradientBoostedTrees
import org.apache.spark.mllib.tree.configuration.BoostingStrategy
val sqlContext = new SQLContext(sc)
val df_train = sqlContext.read.format("com.databricks.spark.csv").option("header", "true").option("inferSchema", "true").load(data_path_train).select(all_toUse_name.head, all_toUse_name.tail: _*)
val index_transformers: Array[org.apache.spark.ml.PipelineStage] = char_col_toUse_names.map(
cname => new StringIndexer()
.setInputCol(cname)
.setOutputCol("int_"+cname)
)
val one_hot_encoders: Array[org.apache.spark.ml.PipelineStage] = char_col_toUse_names.map(
cname => new OneHotEncoder()
.setInputCol("int_"+cname)
.setOutputCol("one_hot_"+cname)
)
var one_hot_col_names = char_col_toUse_names.clone
for((x,i) <- char_col_toUse_names.view.zipWithIndex) one_hot_col_names(i) = "one_hot_"+x
val col_name = one_hot_col_names++num_col_toUse_names
val assembler = new VectorAssembler().setInputCols(col_name).setOutputCol("features")
var stages: Array[org.apache.spark.ml.PipelineStage] = index_transformers ++ one_hot_encoders :+ assembler
val pipeline = new Pipeline().setStages(stages)
val indexed_df_train = pipeline.fit(df_train).transform(df_train);
val trainingData = indexed_df_train.select(col(target_col_name).cast("double").alias("label"), col("features")).map(row => LabeledPoint(row.getDouble(0), row(1).asInstanceOf[Vector]))
var boostingStrategy = BoostingStrategy.defaultParams("Regression")
boostingStrategy.numIterations = 10 // Note: Use more iterations in practice.
boostingStrategy.treeStrategy.maxDepth = 10
boostingStrategy.treeStrategy.categoricalFeaturesInfo = Map[Int, Int]()
val model = GradientBoostedTrees.train(sc.parallelize(trainingData.collect()), boostingStrategy)
我们如何解决这个问题,也可以在最后一种方法中论证sc.parallelize(trainingData.collect())
将RDD转换为数据帧然后再转换回RDD,但这会加快计算速度。
My Guess(可能完全错误)是ML算法处理DataFrame而不是RDD,sc.parallelize给出RDD,因此ML算法永远不能与sc.parallelize一起使用。如果是这种情况,那么在使用ML算法(而不是MLLIB)时如何确保代码快速完成。
问候