我使用Spark Mllib为我的数据生成预测,然后以Avro格式将它们存储到HDFS:
val dataPredictions = myModel.transform(myData)
val output = dataPredictions.select("is", "probability", "prediction")
output.write.format("com.databricks.spark.avro").save(path)
我收到以下例外:
com.databricks.spark.avro.SchemaConverters$IncompatibleSchemaException:
Unexpected type org.apache.spark.ml.linalg.VectorUDT@3bfc3ba7.
我的理解是'预测'列格式无法序列化为Avro。
答案 0 :(得分:0)
要将任何Vector
转换为Array[Double]
,您可以使用以下UDF:
import org.apache.spark.sql.functions.udf
import org.apache.spark.sql.functions.col
import org.apache.spark.ml.linalg.Vector
val vectorToArrayUdf = udf((vector: Vector) => vector.toArray)
// The following will work
val output = dataPredictions
.withColumn("probabilities", vectorToArrayUdf(col("probability")))
.select("id", "probabilities", "prediction")
output.write.format("com.databricks.spark.avro").save(path)