我想爆炸features
(类型:ml.linalg的稀疏向量)作为每个功能的索引和值,所以我要做以下事情:
def zipKeyValue(vec:linalg.Vector) : Array[(Int,Double)] = {
val indice:Array[Int] = vec.toSparse.indices;
val value:Array[Double] = vec.toSparse.values;
indice.zip(value)
}
val udf1 = udf( zipKeyValue _)
val df1 = df.withColumn("features",udf1(col("features")));
val df2 = df1.withColumn("features",explode(col("features")) );
val udf2 = udf( ( f:Tuple2[Int,Double]) => f._1.toString ) ;
val udf3 = udf( (f:Tuple2[Int,Double]) =>f._2) ;
val df3 = df2.withColumn("key",udf2(col("features"))).withColumn("value",udf3(col("features")));
df3.show();
但是我得到了错误:
Failed to execute user defined function(anonfun$38: (struct<_1:int,_2:double>) => string)
Caused by: java.lang.ClassCastException: org.apache.spark.sql.catalyst.expressions.GenericRowWithSchema cannot be cast to scala.Tuple2
这对我来说很困惑,因为我的函数zipKeyValue
返回了Tuple2[(Int,Double)]
,但是实际上我得到了struct<_1:int,_2:double>
。我该如何解决?
答案 0 :(得分:0)
您在这里不需要UDF。只需选择列即可。
df2
.withColumn("key", col("features._1"))
.withColumn("value", col("features._2"))
通常情况下,您应该使用Rows
而不是Tuples
:
import org.apache.spark.sql.Row
val udf2 = udf((f: Row) => f.getInt(0).toString)
val udf3 = udf((f: Row) => f.getDouble(1))