我有一个RDD (long, vector)
。我想对所有向量求和。如何在火花1.6中实现它?
例如,输入数据就像
(1,[0.1,0.2,0.7])
(2,[0.2,0.4,0.4])
然后产生类似的结果 [0.3,0.6,1.1]
无论long
答案 0 :(得分:3)
如果您有这样的RDD [Long,Vector]:
val myRdd = sc.parallelize(List((1l, Vectors.dense(0.1, 0.2, 0.7)),(2l, Vectors.dense(0.2, 0.4, 0.4))))
您可以减少值(向量)以获得总和:
val res = myRdd
.values
.reduce {case (a:(Vector), b:(Vector)) =>
Vectors.dense((a.toArray, b.toArray).zipped.map(_ + _))}
我得到以下带有浮点错误的结果:
[0.30000000000000004,0.6000000000000001,1.1]
来源:this
答案 1 :(得分:0)
您可以参考spark example,关于:
val model = pipeline.fit(df)
val documents = model.transform(df)
.select("features")
.rdd
.map { case Row(features: MLVector) => Vectors.fromML(features) }
.zipWithIndex()
.map(_.swap)
(documents,
model.stages(2).asInstanceOf[CountVectorizerModel].vocabulary,
//vocabulary
documents.map(_._2.numActives).sum().toLong)
//total token count