Question

我想爆炸features（类型：ml.linalg的稀疏向量）作为每个功能的索引和值，所以我要做以下事情：

def zipKeyValue(vec:linalg.Vector) : Array[(Int,Double)] = {
  val indice:Array[Int] = vec.toSparse.indices;
  val value:Array[Double] = vec.toSparse.values;
  indice.zip(value)
}
val udf1 = udf( zipKeyValue _)
val df1 = df.withColumn("features",udf1(col("features")));
val df2 = df1.withColumn("features",explode(col("features")) );   
val udf2 = udf( ( f:Tuple2[Int,Double]) => f._1.toString )   ;
val udf3 = udf( (f:Tuple2[Int,Double]) =>f._2) ;
val df3 = df2.withColumn("key",udf2(col("features"))).withColumn("value",udf3(col("features")));
df3.show();

但是我得到了错误： Failed to execute user defined function(anonfun$38: (struct<_1:int,_2:double>) => string) Caused by: java.lang.ClassCastException: org.apache.spark.sql.catalyst.expressions.GenericRowWithSchema cannot be cast to scala.Tuple2

这对我来说很困惑，因为我的函数zipKeyValue返回了Tuple2[(Int,Double)]，但是实际上我得到了struct<_1:int,_2:double>。我该如何解决？

Answer 1

您在这里不需要UDF。只需选择列即可。

df2
  .withColumn("key", col("features._1"))
  .withColumn("value", col("features._2"))

通常情况下，您应该使用Rows而不是Tuples：

import org.apache.spark.sql.Row

val udf2 = udf((f: Row) => f.getInt(0).toString)
val udf3 = udf((f: Row) => f.getDouble(1))

在Spark SQL中使用UDF函数后，如何解决此类型错误？

1 个答案: