我有一个Scala Spark DataFrame(变量df
):
id, values
"a", [0.5, 0.6]
"b", [0.1, 0.2]
...
我正在尝试利用RowMatrix来高效地计算成对的余弦相似度。
final case class dataRow(id: String, values: Array[Double])
val rows = df.as[dataRow].map {
row => {
Vectors.dense(row.values)
}
}.rdd
我遇到以下编译错误
Unable to find encoder for type stored in a Dataset. Primitive types (Int, String, etc) and Product types (case classes) are supported by importing spark.implicits._
最终,我可以做到这一点(RowMatrix需要一个RDD [Vector])
val mat = new RowMatrix(rows)
我已经导入了spark.implicits_,我在做什么错了?
答案 0 :(得分:1)
Vector
类型根本没有隐式编码器。因此,要么在`rdd
val rows = df.as[dataRow].rdd.map(row => Vectors.dense(row.values))
或提供一个Encoder
import org.apache.spark.sql.Encoder
import org.apache.spark.sql.catalyst.encoders.ExpressionEncoder
ds.as[dataRow].map(x => Vectors.dense(x.values))(ExpressionEncoder(): Encoder[Vector])
答案 1 :(得分:0)
您正在使用哪个Vectors对象?
尝试导入linalg上下文。库中可能存在冲突。
也将case类域对象移出函数范围,然后删除final
import org.apache.spark.mllib.linalg.{Vectors, Vector}
import org.apache.spark.mllib.linalg.distributed.RowMatrix
case class DataRow(id: String, values: Array[Double])
def func(spark: SparkSession, df: DataFrame): RowMatrix = {
import spark.implicits._
val rows = df.as[DataRow]
.map(row => Vectors.dense(row.values))
.rdd
val mat = new RowMatrix(rows)
mat
}