Question

我想使用UDF将数据框中的列数据类型之一转换为字符串。

当我printSchema我的数据框时，该列确实显示了vector数据类型，但是当我使用UDF将向量转换为字符串时，出现错误：

org.apache.spark.sql.AnalysisException: cannot resolve 'UDF(probability)' 

due to data type mismatch: argument 1 requires vector type, however, '`probability`' is of 

struct<type:tinyint,size:int,indices:array<int>,values:array<double>> type.;;

进口

import org.apache.spark.mllib.linalg.{Vector, Vectors}
import com.microsoft.ml.spark.{LightGBMClassifier,LightGBMClassificationModel}
import org.apache.spark.ml.{Pipeline, PipelineModel, PipelineStage}

UDF

val vecToString = udf( (xs: Vector) => xs.toArray.mkString(";"))

DataFrame（printSchema）

val inputData = spark.read.parquet(inputDataPath)
val pipelineModel = PipelineModel.load(modelPath)
val predictions = pipelineModel.transform(inputData)

 # Selecting only 2 columns from predictions DF:

 |-- probability: vector (nullable = true)
 |-- prediction: double (nullable = false)

+-----------------------------------------+----------+
|probability                              |prediction|
+-----------------------------------------+----------+
|[0.2554504562575961,0.7445495437424039]  |1.0       |
|[0.7763149003135102,0.22368509968648975] |0.0       |

使用我的UDF将概率列转换为字符串

val tmp = predictions
                  .withColumn("probabilityStr" , vecToString($"probability"))

这是发生上述错误的地方。

也尝试过：

val vecToString = udf( (xs: Array[Double]) => xs.mkString(";"))

AnalysisException: cannot resolve 'UDF(probability)' due to data type mismatch: argument 1 requires array<double> type, however, '`probability`' is of struct<type:tinyint,size:int,indices:array<int>,values:array<double>> type.;;

当我使用其他模型（不是Light GBM时，这可以正常工作。是否可能是由于所用模型的类型？）

Spark LightGBM预测与输出数据类型的printSchema不同的数据帧数据类型

0 个答案: