如何将spark数据帧转换为稀疏向量的行以创建RowMatrix对象

时间:2018-12-29 02:10:35

标签: apache-spark apache-spark-mllib

spark svd代码示例如下

val data = Array(
   Vectors.sparse(5, Seq((1, 1.0), (3, 7.0))),
   Vectors.dense(2.0, 0.0, 3.0, 4.0, 5.0),
   Vectors.dense(4.0, 0.0, 0.0, 6.0, 7.0))

val rows = sc.parallelize(data)
val mat: RowMatrix = new RowMatrix(rows)

val svd: SingularValueDecomposition[RowMatrix, Matrix] = mat.computeSVD(5, computeU = true)

现在问题是如何创建数据rdd [Vectors.sparse]对象以适合svd函数。数据如下所示,它是Vectors.sparse的类型

{"type":0,"size":205209,"indices":[24119,32380,201090],"values": 
[1.8138314440983385,1.6036455249478836,1.3787660101958308]}
{"type":0,"size":205209,"indices":[24119,32380,176747,201090],"values": 
[5.441494332295015,3.207291049895767,3.2043056252302478,2.7575320203916616]}

到目前为止,我已经尝试过

val rows = df_raw_features.select("raw_features").rdd.map(Vectors.sparse).map(Row(_))

我收到此错误

[error] /home/lujunchen/project/spark_code/src/main/scala/svd_feature_engineer.scala:39:71: type mismatch;
[error]  found   : (size: Int, elements: Iterable[(Integer, Double)])org.apache.spark.mllib.linalg.Vector <and> (size: Int, elements: Seq[(Int, Double)])org.apache.spark.mllib.linalg.Vector <and> (size: Int, indices: Array[Int], values: Array[Double])org.apache.spark.mllib.linalg.Vector
[error]  required: org.apache.spark.sql.Row => ?
[error]     val rows = df_raw_features.select("raw_features").rdd.map(Vectors.sparse).map(Row(_))

0 个答案:

没有答案