从RDD的元素创建SparseVector

时间:2016-09-14 19:40:35

标签: scala apache-spark rdd scala-collections

使用Spark,我在Scala中有一个类型为val rdd = RDD[(x: Int, y:Int), cov:Double]的数据结构,其中RDD的每个元素代表一个矩阵的元素,x代表行,y代表列和cov表示元素的值:

我需要从这个矩阵的行创建SparseVectors。所以我决定先将rdd转换为RDD[x: Int, (y:Int, cov:Double)],然后使用groupByKey将特定行的所有元素放在一起,如下所示:

val rdd2 = rdd.map{case ((x,y),cov) => (x, (y, cov))}.groupByKey()

现在我需要创建SparseVectors:

val N = 7     //Vector Size
val spvec = {(x: Int,y: Iterable[(Int, Double)]) => new SparseVector(N.toLong, Array(y.map(el => el._1.toInt)), Array(y.map(el => el._2.toDouble)))}
val vecs = rdd2.map(spvec)

然而,这是弹出的错误。

type mismatch; found :Iterable[Int] required:Int
type mismatch; found :Iterable[Double] required:Double

我猜测y.map(el => el._1.toInt)正在返回一个无法应用Array的迭代。如果有人可以帮忙解决这个问题,我将不胜感激。

1 个答案:

答案 0 :(得分:0)

最简单的解决方案是转换为func (talk *Talk) GetTalkByUsersId() bool { talk1 := new(Talk) talk2 := new(Talk) curs, _ := r.Table("Talks"). Filter(r.Row.Field("UserIdX").Eq(talk.UserIdX)). Filter(r.Row.Field("UserIdY").Eq(talk.UserIdY)). Run(api.Sess) curs2, _ := r.Table("Talks"). Filter(r.Row.Field("UserIdX").Eq(talk.UserIdY)). Filter(r.Row.Field("UserIdY").Eq(talk.UserIdX)). Run(api.Sess) curs.One(&talk1) curs2.One(&talk2) if talk1.Id == "" && talk2.Id == "" { return false } if talk1.Id != "" { talk.copyTalk(talk1) } else { talk.copyTalk(talk2) } return true }

RowMatrix

如果要保留行索引,可以改为使用import org.apache.spark.mllib.linalg.distributed.{CoordinateMatrix, MatrixEntry} val rdd: RDD[((Int, Int), Double)] = ??? val vs: RDD[org.apache.spark.mllib.linalg.SparseVector]= new CoordinateMatrix( rdd.map{ case ((x, y), cov) => MatrixEntry(x, y, cov) } ).toRowMatrix.rows.map(_.toSparse)

toIndexedRowMatrix