我想使用UDF将数据框中的列数据类型之一转换为字符串。
当我printSchema
我的数据框时,该列确实显示了vector
数据类型,但是当我使用UDF将向量转换为字符串时,出现错误:
org.apache.spark.sql.AnalysisException: cannot resolve 'UDF(probability)'
due to data type mismatch: argument 1 requires vector type, however, '`probability`' is of
struct<type:tinyint,size:int,indices:array<int>,values:array<double>> type.;;
进口
import org.apache.spark.mllib.linalg.{Vector, Vectors}
import com.microsoft.ml.spark.{LightGBMClassifier,LightGBMClassificationModel}
import org.apache.spark.ml.{Pipeline, PipelineModel, PipelineStage}
UDF
val vecToString = udf( (xs: Vector) => xs.toArray.mkString(";"))
DataFrame(printSchema)
val inputData = spark.read.parquet(inputDataPath)
val pipelineModel = PipelineModel.load(modelPath)
val predictions = pipelineModel.transform(inputData)
# Selecting only 2 columns from predictions DF:
|-- probability: vector (nullable = true)
|-- prediction: double (nullable = false)
+-----------------------------------------+----------+
|probability |prediction|
+-----------------------------------------+----------+
|[0.2554504562575961,0.7445495437424039] |1.0 |
|[0.7763149003135102,0.22368509968648975] |0.0 |
使用我的UDF将概率列转换为字符串
val tmp = predictions
.withColumn("probabilityStr" , vecToString($"probability"))
这是发生上述错误的地方。
也尝试过:
val vecToString = udf( (xs: Array[Double]) => xs.mkString(";"))
AnalysisException: cannot resolve 'UDF(probability)' due to data type mismatch: argument 1 requires array<double> type, however, '`probability`' is of struct<type:tinyint,size:int,indices:array<int>,values:array<double>> type.;;
当我使用其他模型(不是Light GBM时,这可以正常工作。是否可能是由于所用模型的类型?)