在Spark SQL中使用UDF函数后,如何解决此类型错误?

时间:2018-08-21 09:45:13

标签: scala apache-spark apache-spark-sql user-defined-functions

我想爆炸features(类型:ml.linalg的稀疏向量)作为每个功能的索引和值,所以我要做以下事情:

def zipKeyValue(vec:linalg.Vector) : Array[(Int,Double)] = {
  val indice:Array[Int] = vec.toSparse.indices;
  val value:Array[Double] = vec.toSparse.values;
  indice.zip(value)
}
val udf1 = udf( zipKeyValue _)
val df1 = df.withColumn("features",udf1(col("features")));
val df2 = df1.withColumn("features",explode(col("features")) );   
val udf2 = udf( ( f:Tuple2[Int,Double]) => f._1.toString )   ;
val udf3 = udf( (f:Tuple2[Int,Double]) =>f._2) ;
val df3 = df2.withColumn("key",udf2(col("features"))).withColumn("value",udf3(col("features")));
df3.show();

但是我得到了错误: Failed to execute user defined function(anonfun$38: (struct<_1:int,_2:double>) => string) Caused by: java.lang.ClassCastException: org.apache.spark.sql.catalyst.expressions.GenericRowWithSchema cannot be cast to scala.Tuple2

这对我来说很困惑,因为我的函数zipKeyValue返回了Tuple2[(Int,Double)],但是实际上我得到了struct<_1:int,_2:double>。我该如何解决?

1 个答案:

答案 0 :(得分:0)

您在这里不需要UDF。只需选择列即可。

df2
  .withColumn("key", col("features._1"))
  .withColumn("value", col("features._2"))

通常情况下,您应该使用Rows而不是Tuples

import org.apache.spark.sql.Row

val udf2 = udf((f: Row) => f.getInt(0).toString)
val udf3 = udf((f: Row) => f.getDouble(1))