Question

我正在尝试使用groupby对RDD进行分组。大多数文档都建议不要使用groupBy，因为它在内部对组密钥起作用。还有另一种方法可以实现这一目标。我不能使用reducebyKey，因为我没有在这里进行缩减操作。

前 -

Entry - long id, string name;
JavaRDD<Entry> entries = rdd.groupBy(Entry::getId)
                            .flatmap(x -> someOp(x))
                            .values()
                            .filter()

Answer 1

aggregateByKey [Pair] see

与聚合函数类似，但聚合应用于具有相同键的值。与聚合函数不同，初始值不会应用于第二个reduce。

列出变体

def aggregateByKey [U]（zeroValue：U）（seqOp：（U，V）⇒U，combOp：（U，U）   ⇒U）（隐式arg0：ClassTag [U]）：RDD [（K，U）]

def aggregateByKey [U]（zeroValue：U，numPartitions：Int）（seqOp：（U，V）⇒U，   combOp：（U，U）⇒U）（隐式arg0：ClassTag [U]）：RDD [（K，U）]

def aggregateByKey [U]（zeroValue：U，分区：分区程序）（seqOp：（U，   V）⇒U，combOp：（U，U）⇒U）（隐式arg0：ClassTag [U]）：RDD [（K，U）]

示例：

val pairRDD = sc.parallelize(List( ("cat",2), ("cat", 5), ("mouse", 4),("cat", 12), ("dog", 12), ("mouse", 2)), 2)

// lets have a look at what is in the partitions
def myfunc(index: Int, iter: Iterator[(String, Int)]) : Iterator[String] = {
  iter.map(x => "[partID:" +  index + ", val: " + x + "]")
}
pairRDD.mapPartitionsWithIndex(myfunc).collect

res2: Array[String] = Array([partID:0, val: (cat,2)], [partID:0, val: (cat,5)], [partID:0, val: (mouse,4)], [partID:1, val: (cat,12)], [partID:1, val: (dog,12)], [partID:1, val: (mouse,2)])

pairRDD.aggregateByKey(0)(math.max(_, _), _ + _).collect
res3: Array[(String, Int)] = Array((dog,12), (cat,17), (mouse,6))

pairRDD.aggregateByKey(100)(math.max(_, _), _ + _).collect
res4: Array[(String, Int)] = Array((dog,100), (cat,200), (mouse,200))

按java

1 个答案: