像这样训练xgboost然后获取一组响应和概率。概率以矢量形式返回:
%scala
import ml.dmlc.xgboost4j.scala.spark.{DataUtils, XGBoost}
val dataset = sqlContext.table("train_set")
val paramMap = List(
"eta" -> 0.023f,
"max_depth" -> 10,
"base_score" -> 0.005,
"eval_metric" -> "auc",
"seed" -> 49,
"objective" -> "binary:logistic").toMap
val xgboostModel = XGBoost.trainWithDataFrame(dataset, paramMap, 30, 10, useExternalMemory=true)
val test_dataset = sqlContext.table("test_set")
val predictions = xgboostModel.setExternalMemory(true).transform(test_dataset).select("some_key", "probabilities")
org.apache.spark.sql.DataFrame = [some_key:int,probabilities:vector]
/*
+--------+-------------+
|some_key|probabilities|
+--------+----+--------+
| 0| [0.98,0.02] |
| 1| [0.95,0.05] |
| 2| [0.99,0.01] |
| 3| [0.82,0.18] |
+--------+-------------+
*/
我只想要第二个概率而不是整个向量。我将如何使用它和some_key创建一个新的数据框?
/*
+--------+-----------+
|some_key|probability|
+--------+-----------+
| 0| 0.02 |
| 1| 0.05 |
| 2| 0.01 |
| 3| 0.18 |
+--------+-----------+
*/