如何获取每个列表的第一行数据?

时间:2017-01-03 09:34:13

标签: scala apache-spark dataframe

我的DataFrame是这样的:

+------------------------+----------------------------------------+
|ID                      |probability                             |
+------------------------+----------------------------------------+
|583190715ccb64f503a|[0.49128147201958017,0.5087185279804199]|
|58326da75fc764ad200|[0.42143416087939345,0.5785658391206066]|
|583270ff17c76455610|[0.3949217100212508,0.6050782899787492] |
|583287c97ec7641b2d4|[0.4965059792664432,0.5034940207335569] |
|5832d7e279c764f52e4|[0.49128147201958017,0.5087185279804199]|
|5832e5023ec76406760|[0.4775830044196701,0.52241699558033]   |
|5832f88859cb64960ea|[0.4360509428173421,0.563949057182658]  |
|58332e6238c7643e6a7|[0.48730029128352853,0.5126997087164714]|

我使用

获得概率列
val proVal = Data.select("probability").rdd.map(r => r(0)).collect()
proVal.foreach(println)

结果是:

[0.49128147201958017,0.5087185279804199]
[0.42143416087939345,0.5785658391206066]
[0.3949217100212508,0.6050782899787492]
[0.4965059792664432,0.5034940207335569]
[0.49128147201958017,0.5087185279804199]
[0.4775830044196701,0.52241699558033]
[0.4360509428173421,0.563949057182658]
[0.48730029128352853,0.5126997087164714]

但我希望获得每行的第一列数据,如下所示:

0.49128147201958017
0.42143416087939345
0.3949217100212508
0.4965059792664432
0.49128147201958017
0.4775830044196701
0.4360509428173421
0.48730029128352853

怎么办呢?

输入是标准随机森林输入,输入上方是val Data = predictions.select("docID", "probability")

predictions.printSchema()
  

根    | - docID:string(nullable = true)    | - label:double(nullable = false)    | - features:vector(nullable = true)    | - indexedLabel:double(nullable = true)    | - rawPrediction:vector(nullable = true)    | - probability:vector(nullable = true)    | - 预测:double(nullable = true)    | - predictLabel:string(nullable = true)

我希望获得“概率”列的第一个值

1 个答案:

答案 0 :(得分:2)

您可以使用Column.apply方法获取数组列的第n个项目 - 在本例中为第一列(使用索引0):

import sqlContext.implicits._
val proVal = Data.select($"probability"(0)).rdd.map(r => r(0)).collect()
顺便说一句,如果您使用的是Spark 1.6或更高版本,您还可以使用数据集API将数据帧转换为双打的更简洁方法:

val proVal = Data.select($"probability"(0)).as[Double].collect()