使用Spark LogisticRegressionWithLBFGS进行多类分类的预测概率

时间:2016-03-22 10:14:47

标签: apache-spark pyspark logistic-regression apache-spark-mllib

我正在使用LogisticRegressionWithLBFGS()来训练具有多个类的模型。

mllib中的文档中可以看出,clearThreshold()只有在分类为二进制时才能使用。有没有办法使用类似的多类分类来输出模型中给定输入中每个类的概率?

1 个答案:

答案 0 :(得分:0)

有两种方法可以实现这一目标。一种是创建一种方法,承担LogisticRegression.scala

predictPoint的责任
object ClassificationUtility {
  def predictPoint(dataMatrix: Vector, model: LogisticRegressionModel):
    (Double, Array[Double]) = {
    require(dataMatrix.size == model.numFeatures)
    val dataWithBiasSize: Int = model.weights.size / (model.numClasses - 1)
    val weightsArray: Array[Double] = model.weights match {
      case dv: DenseVector => dv.values
      case _ =>
        throw new IllegalArgumentException(s"weights only supports dense vector but got type ${model.weights.getClass}.")
    }
    var bestClass = 0
    var maxMargin = 0.0
    val withBias = dataMatrix.size + 1 == dataWithBiasSize
    val classProbabilities: Array[Double] = new Array[Double (model.numClasses)
    (0 until model.numClasses - 1).foreach { i =>
      var margin = 0.0
      dataMatrix.foreachActive { (index, value) =>
      if (value != 0.0) margin += value * weightsArray((i * dataWithBiasSize) + index)
      }
      // Intercept is required to be added into margin.
      if (withBias) {
        margin += weightsArray((i * dataWithBiasSize) + dataMatrix.size)
      }
      if (margin > maxMargin) {
        maxMargin = margin
        bestClass = i + 1
      }
      classProbabilities(i+1) = 1.0 / (1.0 + Math.exp(-margin))
    }
    return (bestClass.toDouble, classProbabilities)
  }
}

请注意,它与原始方法略有不同,它只是根据输入要素计算逻辑。它还定义了一些最初是私有的val和vars,并包含在此方法之外。最终,它会对数组中的分数进行索引,并将其与最佳答案一起返回。我这样称呼我的方法:

// Compute raw scores on the test set.
val predictionAndLabelsAndProbabilities = test
  .map { case LabeledPoint(label, features) =>
val (prediction, probabilities) = ClassificationUtility
  .predictPoint(features, model)
(prediction, label, probabilities)}

<强>然而

似乎Spark的贡献者不鼓励使用MLlib来支持ML。 ML逻辑回归API目前不支持多类分类。我现在使用OneVsRest作为一个与所有分类的包装器。您可以通过迭代模型获得原始分数:

val lr = new LogisticRegression().setFitIntercept(true)
val ovr = new OneVsRest()
ovr.setClassifier(lr)
val ovrModel = ovr.fit(training)
ovrModel.models.zipWithIndex.foreach {
  case (model: LogisticRegressionModel, i: Int) =>
    model.save(s"model-${model.uid}-$i")
}

val model0 = LogisticRegressionModel.load("model-logreg_457c82141c06-0")
val model1 = LogisticRegressionModel.load("model-logreg_457c82141c06-1")
val model2 = LogisticRegressionModel.load("model-logreg_457c82141c06-2")

现在你有了各个模型,你可以通过计算rawPrediction的sigmoid来获得概率

def sigmoid(x: Double): Double = {
  1.0 / (1.0 + Math.exp(-x))
}

val newPredictionAndLabels0 = model0.transform(newRescaledData)
  .select("prediction", "rawPrediction")
  .map(row => (row.getDouble(0),
    sigmoid(row.getAs[org.apache.spark.mllib.linalg.DenseVector](1).values(1)) ))
newPredictionAndLabels0.foreach(println)

val newPredictionAndLabels1 = model1.transform(newRescaledData)
  .select("prediction", "rawPrediction")
  .map(row => (row.getDouble(0),
    sigmoid(row.getAs[org.apache.spark.mllib.linalg.DenseVector](1).values(1)) ))
newPredictionAndLabels1.foreach(println)

val newPredictionAndLabels2 = model2.transform(newRescaledData)
  .select("prediction", "rawPrediction")
  .map(row => (row.getDouble(0),
    sigmoid(row.getAs[org.apache.spark.mllib.linalg.DenseVector](1).values(1)) ))
newPredictionAndLabels2.foreach(println)