在Spark-1.6.1上使用DMLC的XGBoost-4j

时间:2016-04-21 12:50:20

标签: scala apache-spark prediction xgboost

我正在尝试在Spark-1.6.1上使用DMLC的XGBoost实现。我可以使用XGBoost训练我的数据但是在预测方面遇到困难。我实际上希望以Apache Spark mllib库中的方式进行预测,这有助于计算训练误差,精度,回忆,特异性等。

我发布下面的代码,也是我得到的错误。 我在spark-shell中使用了xgboost4j-spark-0.5-jar-with-dependencies.jar来启动。

import org.apache.spark.mllib.regression.LabeledPoint
import org.apache.spark.mllib.linalg.Vectors
import org.apache.spark.mllib.util.MLUtils
import org.apache.spark.SparkContext._
import org.apache.spark.mllib.linalg.Vectors
import org.apache.spark.mllib.regression.LabeledPoint
import ml.dmlc.xgboost4j.scala.Booster
import ml.dmlc.xgboost4j.scala.spark.XGBoost
import ml.dmlc.xgboost4j.scala.DMatrix
import ml.dmlc.xgboost4j.scala.{Booster, DMatrix}
import ml.dmlc.xgboost4j.scala.spark.{DataUtils, XGBoost}
import org.apache.spark.{SparkConf, SparkContext}




//Load and parse the data file.
val data = sc.textFile("file:///home/partha/credit_approval_2_attr.csv")
val data1 = sc.textFile("file:///home/partha/credit_app_fea.csv")


val parsedData = data.map { line =>
    val parts = line.split(',').map(_.toDouble)
    LabeledPoint(parts(0), Vectors.dense(parts.tail))
}.cache()

val parsedData1 = data1.map { line =>
    val parts = line.split(',').map(_.toDouble)
    Vectors.dense(parts)
}



//Tuning Parameters
val paramMap = List(
      "eta" -> 0.1f,  
      "max_depth" -> 5,
      "num_class" -> 2,
      "objective" -> "multi:softmax" ,
      "colsample_bytree" -> 0.8,
       "alpha" -> 1,
       "subsample" -> 0.5).toMap

  //Training the model  
val numRound = 20
val model = XGBoost.train(parsedData, paramMap, numRound, nWorkers = 1)
val pred = model.predict(parsedData1)
pred.collect()

pred的输出:

res0: Array[Array[Array[Float]]] = Array(Array(Array(0.0), Array(1.0), Array(1.0), Array(1.0), Array(0.0), Array(0.0), Array(1.0), Array(1.0), Array(0.0), Array(1.0), Array(0.0), Array(0.0), Array(0.0), Array(1.0), Array(1.0), Array(1.0), Array(1.0), Array(1.0), Array(0.0), Array(1.0), Array(1.0), Array(0.0), Array(1.0), Array(1.0), Array(1.0), Array(1.0), Array(1.0), Array(1.0), Array(1.0), Array(1.0), Array(1.0), Array(1.0), Array(1.0), Array(1.0), Array(1.0), Array(1.0), Array(1.0), Array(1.0), Array(0.0), Array(1.0), Array(1.0), Array(1.0), Array(0.0), Array(1.0), Array(1.0), Array(1.0), Array(1.0), Array(1.0), Array(0.0), Array(0.0), Array(0.0), Array(0.0), Array(1.0), Array(0.0), Array(0.0), Array(0.0), Array(0.0), Array(0.0), Array(0.0), Array(1.0), Array(1.0), Array(1.0), Array(...

现在我正在使用:

val labelAndPreds = parsedData.map { point =>
          val prediction = model.predict(point.features)
          (point.label, prediction)
        }

输出:

<console>:66: error: overloaded method value predict with alternatives:
  (testSet: ml.dmlc.xgboost4j.scala.DMatrix)Array[Array[Float]] <and>
  (testSet: org.apache.spark.rdd.RDD[org.apache.spark.mllib.linalg.Vector])org.apache.spark.rdd.RDD[Array[Array[Float]]]
 cannot be applied to (org.apache.spark.mllib.linalg.Vector)
                  val prediction = model.predict(point.features)
                                     ^

然后尝试了这个,因为预测需要RDD [Vector]。

val labelAndPreds1 = parsedData.map { point =>
          val prediction = model.predict(Vectors.dense(point.features))
          (point.label, prediction)
        }

结果是:

<console>:66: error: overloaded method value dense with alternatives:
  (values: Array[Double])org.apache.spark.mllib.linalg.Vector <and>
  (firstValue: Double,otherValues: Double*)org.apache.spark.mllib.linalg.Vector
 cannot be applied to (org.apache.spark.mllib.linalg.Vector)
                  val prediction = model.predict(Vectors.dense(point.features))
                                                         ^

显然,我正在尝试解决RDD类型的问题,这很容易使用GBT on spark(http://spark.apache.org/docs/latest/mllib-ensembles.html#gradient-boosted-trees-gbts)。

我是否尝试以正确的方式做到这一点?

任何帮助或建议都会很棒。

2 个答案:

答案 0 :(得分:3)

实际上,这在XGboost算法中是不可用的。 我在这里面临同样的问题,并实施了以下方法:

import ml.dmlc.xgboost4j.scala.spark.DataUtils // thanks to @Z Simon

def labelPredict(testSet: RDD[XGBLabeledPoint],
               useExternalCache: Boolean = false,
               booster: XGBoostModel): RDD[(Float, Float)] = {
val broadcastBooster = testSet.sparkContext.broadcast(booster)
testSet.mapPartitions { testData =>
  val (auxiliaryIterator, testDataIterator) = testData.duplicate
  val testDataArray = auxiliaryIterator.toArray
  val prediction = broadcastBooster.value.predict(new DMatrix(testDataIterator)).flatten
  testDataArray
    .zip(prediction)
    .map {
      case (labeledPoint, predictionValue) =>
        (labeledPoint.label, predictionValue)
    }.toIterator
}

}

这几乎与XGBoost实际上相同,但它在预测返回时使用了labelpoint标签。当您将Labeledpoint传递给此方法时,它将为每个值返回元组的RDD(标签,预测)。

答案 1 :(得分:1)

如果您阅读了predict()的源代码

def predict(testSet: RDD[Vector]): RDD[Array[Array[Float]]] = {
    import DataUtils._
    val broadcastBooster = testSet.sparkContext.broadcast(_booster)
    testSet.mapPartitions { testSamples =>
      if (testSamples.hasNext) {
        val dMatrix = new DMatrix(new JDMatrix(testSamples, null))
        Iterator(broadcastBooster.value.predict(dMatrix))
      } else {
        Iterator()
      }
    }
  }

你会在testData上找到testSet.mapPartitions(),结果是数组数组,内部数组是测试数据的预测值。你应该在结果上做flatMap。