我有一个 RDD [(((ID,code),value)]
形式的RDD示例RDD:
N = 3;
sep = A==-1;
sep(1,:) = true;
idx = cumsum(sep(:));
sep(1,:) = A(1,:)==-1;
num = accumarray(idx, A(:)==1);
iff = num <= N;
Aclean = reshape(sep(:)|iff(idx), size(A)) .* A;
预期结果 RDD [String,Vectors.dense(...))
示例:
((00001, 234) 7.0)
((00001, 456) 6.0)
((00001, 467) 3.0)
((00002, 245) 8.0)
((00002, 765) 9.0)
...
我尝试了以下方法:
(00001, vector(7.0, 6.0, 3.0))
(00002, vector(8.0, 9.0))
但是出现以下错误:
val vectRDD = InRDD.groupBy(f => f._1._1)
.map(m => (m._1, Vectors.dense(m._2._2)))
建议?
答案 0 :(得分:2)
您快到了–只是缺少第二个元组元素的内部map
来组装DenseVector:
import org.apache.spark.ml.linalg.Vectors
val rdd = sc.parallelize(Seq(
(("00001", 234), 7.0),
(("00001", 456), 6.0),
(("00001", 467), 3.0),
(("00002", 245), 8.0),
(("00002", 765), 9.0)
))
rdd.
groupBy(_._1._1).
map(t => (t._1, Vectors.dense(t._2.map(_._2).toArray))).
collect
// res1: Array[(String, org.apache.spark.ml.linalg.Vector)] =
// Array((00001,[7.0,6.0,3.0]), (00002,[8.0,9.0]))
请注意,Vector.dense
采用了一个数组[Double],因此toArray
也采用了这种方法。