我有一个像这样的RDD:
RDD[(Vector, Int)] :
example : [0.0,0.0,0.0,0.0,0.21052631578947367,0.7894736842105263,0.0,0.0],97
我希望获得的是每个键的最大值(例如,键是97)和我的vector的每个维度
维度的意思是:
[0.0 , 0.0 , 0.0 , 0.0 , 0.21052631578947367 , 0.7894736842105263 , 0.0 , 0.0]
^ ^ ^ ^ ^ ^ ^ ^
Dim1 , Dim2 ,Dim3, Dim4, Dim5 , Dim6 , Dim7 , Dim8
所以基本上我想得到所有RDD之间的每个键和每个维度1维度的最大值...
事实上我正在尝试使用numDimension作为参数,但我不能这样使用它:
def getMaxValue(data: RDD[DBSCANLabeledPoint], numDimension:Int) : RDD[(Int)] = {
data.map(p => (p.${numDimension},p.cluster)).reduceByKey(math.max(_, _))
}
有人能帮助我吗?
答案 0 :(得分:2)
假设我们有一个vectors : Rdd[(Vector, Int)]
(即org.apache.spark.mllib.linalg.Vector
),包含许多(vector[float], Int)
对,如:
[0.1,0.0,0.0,0.0,0.21052631578947367,0.7894736842105263,0.0,0.0],97
[0.0,0.3,0.0,0.0,0.21052631578947367,0.7894736842105263,0.0,0.0],97
[0.0,0.0,0.0,0.0,0.21052631578947367,0.7894736842105263,0.0,0.0],99
[0.0,0.0,0.0,0.0,0.21052631578947367,0.7894736842105263,0.0,0.0],96
这就是我要做的事情:
val result : RDD[(Int, Vector)] = vectors
.map(tuple => (tuple._2, tuple._1))
.reduceByKey((left, right) =>
Vectors.dense(
left.toArray.zip(right.toArray)
.map(pair => pair._1.max(pair._2)
)
)
)
以下是代码的作用:
reduceByKey
reduceByKey
方法,用于将两个元素减少为单个元素。它做了两件事。首先,将两个矢量划分为成对的单个矢量,保持两个矢量的值。然后我们将地图应用于该向量,将Vector[(Float,Float)]
转换为Vector[Float]
。我们通过将每对Float
替换为两者的最大值来实现此目的。org.apache.spark.mllib.linalg.Vector
不支持zip()
,我们将zip转换为Array[Float]
,然后返回{{1}当我们完成合并数组时。因此,在运行上面的代码后,Vector
将具有以下值:
result