我有一些DataFrame,它是来自代表逻辑回归模型的管道模型变换的预测。它会产生一个"概率"作为矢量的列,可能表示相对于可预测值(0和1)的回归线。我如何获得价值观?我天真的做法:
predictionDF.select("probability").show()
predictionDF.select("probability").printSchema()
prediction.withColumn("certainty_no_brudd",
col("probability").cast("vector")(0))
给我以下输出:
+--------------------+
| probability|
+--------------------+
|[0.79704719956042...|
|[0.96065621060123...|
|[0.94869126147921...|
|[0.98881973295162...|
|[0.94738842407184...|
|[0.99517040850391...|
|[0.67513098659304...|
|[0.98185993174719...|
|[0.88716858689769...|
|[0.94886839225328...|
|[0.87093946910993...|
|[0.93752063096904...|
|[0.99093365566705...|
|[0.97163117781123...|
|[0.88384736556118...|
|[0.89095359364458...|
|[0.94304454190511...|
|[0.96116865958545...|
|[0.91555675983743...|
|[0.96092603080292...|
+--------------------+
only showing top 20 rows
root
|-- probability: vector (nullable = true)
Exception in thread "main" org.apache.spark.sql.catalyst.parser.ParseException:
DataType vector() is not supported.(line 1, pos 0)
== SQL ==
vector
^^^
at org.apache.spark.sql.catalyst.parser.AstBuilder$$anonfun$visitPrimitiveDataType$1.apply(AstBuilder.scala:1440)
...
答案 0 :(得分:2)
使用UDF:
import org.apache.spark.ml.linalg._
import org.apache.spark.sql.functions._
val getItem = udf((v: Vector, i: Int) => v(i))
prediction.withColumn("certainty_no_brudd", getItem($"probability", lit(0)))