获得每个键和每个维度的最大RDD

时间:2017-04-14 08:25:46

标签: scala apache-spark max

我有一个像这样的RDD:

RDD[(Vector, Int)] : 
example : [0.0,0.0,0.0,0.0,0.21052631578947367,0.7894736842105263,0.0,0.0],97

我希望获得的是每个键的最大值(例如,键是97)和我的vector的每个维度

维度的意思是:

[0.0 , 0.0 , 0.0 , 0.0 , 0.21052631578947367 , 0.7894736842105263 , 0.0 , 0.0]
  ^     ^     ^     ^         ^                       ^              ^     ^
Dim1 , Dim2 ,Dim3, Dim4,      Dim5           ,        Dim6        , Dim7 , Dim8

所以基本上我想得到所有RDD之间的每个键和每个维度1维度的最大值...

事实上我正在尝试使用numDimension作为参数,但我不能这样使用它:

 def getMaxValue(data: RDD[DBSCANLabeledPoint], numDimension:Int) : RDD[(Int)] = {
   data.map(p =>  (p.${numDimension},p.cluster)).reduceByKey(math.max(_, _))
  }

有人能帮助我吗?

1 个答案:

答案 0 :(得分:2)

假设我们有一个vectors : Rdd[(Vector, Int)](即org.apache.spark.mllib.linalg.Vector),包含许多(vector[float], Int)对,如:

[0.1,0.0,0.0,0.0,0.21052631578947367,0.7894736842105263,0.0,0.0],97
[0.0,0.3,0.0,0.0,0.21052631578947367,0.7894736842105263,0.0,0.0],97
[0.0,0.0,0.0,0.0,0.21052631578947367,0.7894736842105263,0.0,0.0],99
[0.0,0.0,0.0,0.0,0.21052631578947367,0.7894736842105263,0.0,0.0],96

这就是我要做的事情:

val result : RDD[(Int, Vector)] = vectors
  .map(tuple => (tuple._2, tuple._1))
  .reduceByKey((left, right) =>
    Vectors.dense(
      left.toArray.zip(right.toArray)
       .map(pair => pair._1.max(pair._2)
      )
    )
  )

以下是代码的作用:

  1. 地图 - 交换密钥和值,以便我们可以使用reduceByKey
  2. reduceByKey - 使用提供的功能减少使用相同键的所有项目
  3. 内部函数 - 此函数提供给reduceByKey方法,用于将两个元素减少为单个元素。它做了两件事。首先,将两个矢量划分为成对的单个矢量,保持两个矢量的值。然后我们将地图应用于该向量,将Vector[(Float,Float)]转换为Vector[Float]。我们通过将每对Float替换为两者的最大值来实现此目的。
  4. 密集 - 由于org.apache.spark.mllib.linalg.Vector不支持zip(),我们将zip转换为Array[Float],然后返回{{1}当我们完成合并数组时。
  5. 因此,在运行上面的代码后,Vector将具有以下值:

    result