按java

时间:2018-01-09 01:23:36

标签: apache-spark rdd

我正在尝试使用groupby对RDD进行分组。大多数文档都建议不要使用groupBy,因为它在内部对组密钥起作用。还有另一种方法可以实现这一目标。我不能使用reducebyKey,因为我没有在这里进行缩减操作。

前 -

Entry - long id, string name;
JavaRDD<Entry> entries = rdd.groupBy(Entry::getId)
                            .flatmap(x -> someOp(x))
                            .values()
                            .filter()

1 个答案:

答案 0 :(得分:2)

aggregateByKey [Pair] see

与聚合函数类似,但聚合应用于具有相同键的值。与聚合函数不同,初始值不会应用于第二个reduce。

  

列出变体

     

def aggregateByKey [U](zeroValue:U)(seqOp:(U,V)⇒U,combOp:(U,U)   ⇒U)(隐式arg0:ClassTag [U]):RDD [(K,U)]

     

def aggregateByKey [U](zeroValue:U,numPartitions:Int)(seqOp:(U,V)⇒U,   combOp:(U,U)⇒U)(隐式arg0:ClassTag [U]):RDD [(K,U)]

     

def aggregateByKey [U](zeroValue:U,分区:分区程序)(seqOp:(U,   V)⇒U,combOp:(U,U)⇒U)(隐式arg0:ClassTag [U]):RDD [(K,U)]

示例:

val pairRDD = sc.parallelize(List( ("cat",2), ("cat", 5), ("mouse", 4),("cat", 12), ("dog", 12), ("mouse", 2)), 2)

// lets have a look at what is in the partitions
def myfunc(index: Int, iter: Iterator[(String, Int)]) : Iterator[String] = {
  iter.map(x => "[partID:" +  index + ", val: " + x + "]")
}
pairRDD.mapPartitionsWithIndex(myfunc).collect

res2: Array[String] = Array([partID:0, val: (cat,2)], [partID:0, val: (cat,5)], [partID:0, val: (mouse,4)], [partID:1, val: (cat,12)], [partID:1, val: (dog,12)], [partID:1, val: (mouse,2)])

pairRDD.aggregateByKey(0)(math.max(_, _), _ + _).collect
res3: Array[(String, Int)] = Array((dog,12), (cat,17), (mouse,6))

pairRDD.aggregateByKey(100)(math.max(_, _), _ + _).collect
res4: Array[(String, Int)] = Array((dog,100), (cat,200), (mouse,200))