每个人如何处理来自XGBoost评分数据的概率?斯卡拉

时间:2018-05-23 18:24:13

标签: scala apache-spark vector xgboost

像这样训练xgboost然后获取一组响应和概率。概率以矢量形式返回:

%scala 
import ml.dmlc.xgboost4j.scala.spark.{DataUtils, XGBoost}

val dataset = sqlContext.table("train_set")

val paramMap = List(
      "eta" -> 0.023f,
      "max_depth" -> 10,
      "base_score" -> 0.005,
      "eval_metric" -> "auc",
      "seed" -> 49,
      "objective" -> "binary:logistic").toMap

val xgboostModel = XGBoost.trainWithDataFrame(dataset, paramMap, 30, 10, useExternalMemory=true) 

val test_dataset = sqlContext.table("test_set")
val predictions = xgboostModel.setExternalMemory(true).transform(test_dataset).select("some_key", "probabilities")

org.apache.spark.sql.DataFrame = [some_key:int,probabilities:vector]

/*
+--------+-------------+
|some_key|probabilities|
+--------+----+--------+
|       0| [0.98,0.02] |
|       1| [0.95,0.05] |
|       2| [0.99,0.01] |
|       3| [0.82,0.18] |
+--------+-------------+
*/

我只想要第二个概率而不是整个向量。我将如何使用它和some_key创建一个新的数据框?

/*
+--------+-----------+
|some_key|probability|
+--------+-----------+
|       0|      0.02 |
|       1|      0.05 |
|       2|      0.01 |
|       3|      0.18 |
+--------+-----------+
*/

0 个答案:

没有答案