众所周知,Spark中的GBT为您提供了截至目前的预测标签。
我正在考虑计算一个类的预测概率(比如所有实例都属于某个叶子)
构建GBT的代码
import org.apache.spark.SparkContext
import org.apache.spark.mllib.regression.LabeledPoint
import org.apache.spark.mllib.linalg.Vectors
import org.apache.spark.mllib.tree.GradientBoostedTrees
import org.apache.spark.mllib.tree.configuration.BoostingStrategy
import org.apache.spark.mllib.tree.model.GradientBoostedTreesModel
import org.apache.spark.mllib.util.MLUtils
//Importing the data
val data = sc.textFile("data/mllib/credit_approval_2_attr.csv") //using the credit approval data set from UCI machine learning repository
//Parsing the data
val parsedData = data.map { line =>
val parts = line.split(',').map(_.toDouble)
LabeledPoint(parts(0), Vectors.dense(parts.tail))
}
//Splitting the data
val splits = parsedData.randomSplit(Array(0.7, 0.3), seed = 11L)
val training = splits(0).cache()
val test = splits(1)
// Train a GradientBoostedTrees model.
// The defaultParams for Classification use LogLoss by default.
val boostingStrategy = BoostingStrategy.defaultParams("Classification")
boostingStrategy.numIterations = 2 // We can use more iterations in practice.
boostingStrategy.treeStrategy.numClasses = 2
boostingStrategy.treeStrategy.maxDepth = 2
boostingStrategy.treeStrategy.maxBins = 32
boostingStrategy.treeStrategy.subsamplingRate = 0.5
boostingStrategy.treeStrategy.maxMemoryInMB =1024
boostingStrategy.learningRate = 0.1
// Empty categoricalFeaturesInfo indicates all features are continuous.
boostingStrategy.treeStrategy.categoricalFeaturesInfo = Map[Int, Int]()
val model = GradientBoostedTrees.train(training, boostingStrategy)
model.toDebugString
为了简单起见,这给了我2棵深度为2的树:
Tree 0:
If (feature 3 <= 2.0)
If (feature 2 <= 1.25)
Predict: -0.5752212389380531
Else (feature 2 > 1.25)
Predict: 0.07462686567164178
Else (feature 3 > 2.0)
If (feature 0 <= 30.17)
Predict: 0.7272727272727273
Else (feature 0 > 30.17)
Predict: 1.0
Tree 1:
If (feature 5 <= 67.0)
If (feature 4 <= 100.0)
Predict: 0.5739387416147804
Else (feature 4 > 100.0)
Predict: -0.550117566730937
Else (feature 5 > 67.0)
If (feature 2 <= 0.0)
Predict: 3.0383669122382835
Else (feature 2 > 0.0)
Predict: 0.4332824083446489
我的问题是:我可以使用上面的树来计算预测概率,如:
关于用于预测的特征集中的每个实例
exp(来自树0的叶子得分+来自树1的叶子得分)/(1 + exp(来自树0的叶子得分+来自树1的叶子得分))
这给了我一种概率。但不确定这是否是正确的方法。此外,如果有任何文件解释如何计算叶子得分(预测)。如果有人可以分享,我将非常感激。
任何建议都是精湛的。
答案 0 :(得分:2)
这是我使用Spark内部依赖项的方法。您需要稍后为矩阵运算导入线性代数库,即将树预测乘以学习率。
import org.apache.spark.mllib.linalg.{Vectors, Matrices}
import org.apache.spark.mllib.linalg.distributed.{RowMatrix}
假设您使用GBT构建模型:
val model = GradientBoostedTrees.train(trainingData, boostingStrategy)
使用模型对象计算概率:
// Get the log odds predictions from each tree
val treePredictions = testData.map { point => model.trees.map(_.predict(point.features)) }
// Transform the arrays into matrices for multiplication
val treePredictionsVector = treePredictions.map(array => Vectors.dense(array))
val treePredictionsMatrix = new RowMatrix(treePredictionsVector)
val learningRate = model.treeWeights
val learningRateMatrix = Matrices.dense(learningRate.size, 1, learningRate)
val weightedTreePredictions = treePredictionsMatrix.multiply(learningRateMatrix)
// Calculate probability by ensembling the log odds
val classProb = weightedTreePredictions.rows.flatMap(_.toArray).map(x => 1 / (1 + Math.exp(-1 * x)))
classProb.collect
// You may tweak your decision boundary for different class labels
val classLabel = classProb.map(x => if (x > 0.5) 1.0 else 0.0)
classLabel.collect
以下是您可以复制的代码段&amp;直接粘贴到spark-shell中:
import org.apache.spark.mllib.regression.LabeledPoint
import org.apache.spark.mllib.linalg.{Vectors, Matrices}
import org.apache.spark.mllib.linalg.distributed.{RowMatrix}
import org.apache.spark.mllib.tree.GradientBoostedTrees
import org.apache.spark.mllib.tree.configuration.BoostingStrategy
import org.apache.spark.mllib.tree.model.GradientBoostedTreesModel
// Load and parse the data file.
val csvData = sc.textFile("data/mllib/sample_tree_data.csv")
val data = csvData.map { line =>
val parts = line.split(',').map(_.toDouble)
LabeledPoint(parts(0), Vectors.dense(parts.tail))
}
// Split the data into training and test sets (30% held out for testing)
val splits = data.randomSplit(Array(0.7, 0.3))
val (trainingData, testData) = (splits(0), splits(1))
// Train a GBT model.
val boostingStrategy = BoostingStrategy.defaultParams("Classification")
boostingStrategy.numIterations = 50
boostingStrategy.treeStrategy.numClasses = 2
boostingStrategy.treeStrategy.maxDepth = 6
boostingStrategy.treeStrategy.categoricalFeaturesInfo = Map[Int, Int]()
val model = GradientBoostedTrees.train(trainingData, boostingStrategy)
// Get class label from raw predict function
val predictedLabels = model.predict(testData.map(_.features))
predictedLabels.collect
// Get class probability
val treePredictions = testData.map { point => model.trees.map(_.predict(point.features)) }
val treePredictionsVector = treePredictions.map(array => Vectors.dense(array))
val treePredictionsMatrix = new RowMatrix(treePredictionsVector)
val learningRate = model.treeWeights
val learningRateMatrix = Matrices.dense(learningRate.size, 1, learningRate)
val weightedTreePredictions = treePredictionsMatrix.multiply(learningRateMatrix)
val classProb = weightedTreePredictions.rows.flatMap(_.toArray).map(x => 1 / (1 + Math.exp(-1 * x)))
val classLabel = classProb.map(x => if (x > 0.5) 1.0 else 0.0)
classLabel.collect
答案 1 :(得分:1)
def score(features: Vector,gbdt: GradientBoostedTreesModel): Double = {
val treePredictions = gbdt.trees.map(_.predict(features))
blas.ddot(gbdt.numTrees, treePredictions, 1, gbdt.treeWeights, 1)
}
def sigmoid(v : Double) : Double = {
1/(1+Math.exp(-v))
}
// model is output of GradientBoostedTrees.train(...,...)
// testData is libSVM format
val labelAndPreds = testData.map { point =>
var prediction = score(point.features,model)
prediction = sigmoid(prediction)
(point.label, Vectors.dense(1.0-prediction, prediction))
}
答案 2 :(得分:0)
实际上,我能够使用树和问题中给出的树的公式来预测概率。我实际上检查了GBT预测的标签输出。当我使用阈值为0.5时,它完全匹配。
所以我们做了同样的改变。
关于用于预测的特征集中的每个实例:
exp(来自树0的叶子得分+(learning_rate)*来自树1的叶子得分)/(1 + exp(来自树0的叶子得分+(learning_rate)*来自树1的叶子得分))
这基本上给了我预测的概率。
我在深度为3的3棵树上测试过相同的效果。并且还有不同的数据集。
很高兴知道其他人是否已经尝试过这个。 如果没有,他们可以尝试这个并发表评论。
答案 3 :(得分:0)
实际上,上面的ans是错误的,sigmoid函数在这种情况下是假的,因为spark将标签转换为{-1,1}。你应该使用这样的代码:
def score(features: Vector,gbdt: GradientBoostedTreesModel): Double = {
val treePredictions = gbdt.trees.map(_.predict(features))
blas.ddot(gbdt.numTrees, treePredictions, 1, gbdt.treeWeights, 1)
}
val labelAndPreds = testData.map { point =>
var prediction = score(point.features,model)
prediction = 1.0 / (1.0 + math.exp(-2.0 * prediction))
(point.label, Vectors.dense(1.0-prediction, prediction))
}
更多细节可以在&#34;贪婪函数逼近的第9页中看到? Gradient Boosting Machine&#34;。还有一个关于spark的拉取请求:https://github.com/apache/spark/pull/16441
答案 4 :(得分:0)
实际上,@ hbghhy看到的是错误的,@ Run2是正确的,Spark使用二项式对数对数似然率是Loss的两倍,而Friedman使用二项式对数对数似然率是Loss的“贪婪函数近似”的第9页,< / strong>。
/**
* :: DeveloperApi ::
* Class for log loss calculation (for classification).
* This uses twice the binomial negative log likelihood, called "deviance" in Friedman (1999).
*
* The log loss is defined as:
* 2 log(1 + exp(-2 y F(x)))
* where y is a label in {-1, 1} and F(x) is the model prediction for features x.
*/
@Since("1.2.0")
@DeveloperApi
object LogLoss extends ClassificationLoss {
/**
* Method to calculate the loss gradients for the gradient boosting calculation for binary
* classification
* The gradient with respect to F(x) is: - 4 y / (1 + exp(2 y F(x)))
* @param prediction Predicted label.
* @param label True label.
* @return Loss gradient
*/
@Since("1.2.0")
override def gradient(prediction: Double, label: Double): Double = {
- 4.0 * label / (1.0 + math.exp(2.0 * label * prediction))
}
override private[spark] def computeError(prediction: Double, label: Double): Double = {
val margin = 2.0 * label * prediction
// The following is equivalent to 2.0 * log(1 + exp(-margin)) but more numerically stable.
2.0 * MLUtils.log1pExp(-margin)
}
}