如何withColumn一个向量的项目

时间:2017-03-24 10:00:02

标签: scala apache-spark apache-spark-sql

我有一些DataFrame,它是来自代表逻辑回归模型的管道模型变换的预测。它会产生一个"概率"作为矢量的列,可能表示相对于可预测值(0和1)的回归线。我如何获得价值观?我天真的做法:

predictionDF.select("probability").show()
predictionDF.select("probability").printSchema()
prediction.withColumn("certainty_no_brudd",
                      col("probability").cast("vector")(0))

给我以下输出:

+--------------------+
|         probability|
+--------------------+
|[0.79704719956042...|
|[0.96065621060123...|
|[0.94869126147921...|
|[0.98881973295162...|
|[0.94738842407184...|
|[0.99517040850391...|
|[0.67513098659304...|
|[0.98185993174719...|
|[0.88716858689769...|
|[0.94886839225328...|
|[0.87093946910993...|
|[0.93752063096904...|
|[0.99093365566705...|
|[0.97163117781123...|
|[0.88384736556118...|
|[0.89095359364458...|
|[0.94304454190511...|
|[0.96116865958545...|
|[0.91555675983743...|
|[0.96092603080292...|
+--------------------+
only showing top 20 rows

root
 |-- probability: vector (nullable = true)

Exception in thread "main" org.apache.spark.sql.catalyst.parser.ParseException: 
DataType vector() is not supported.(line 1, pos 0)

== SQL ==
vector
^^^


at org.apache.spark.sql.catalyst.parser.AstBuilder$$anonfun$visitPrimitiveDataType$1.apply(AstBuilder.scala:1440)
...

1 个答案:

答案 0 :(得分:2)

使用UDF:

import org.apache.spark.ml.linalg._
import org.apache.spark.sql.functions._


val getItem = udf((v: Vector, i: Int) => v(i))

prediction.withColumn("certainty_no_brudd", getItem($"probability", lit(0)))