我正在尝试使用groupby对RDD进行分组。大多数文档都建议不要使用groupBy,因为它在内部对组密钥起作用。还有另一种方法可以实现这一目标。我不能使用reducebyKey,因为我没有在这里进行缩减操作。
前 -
Entry - long id, string name;
JavaRDD<Entry> entries = rdd.groupBy(Entry::getId)
.flatmap(x -> someOp(x))
.values()
.filter()
答案 0 :(得分:2)
与聚合函数类似,但聚合应用于具有相同键的值。与聚合函数不同,初始值不会应用于第二个reduce。
列出变体
def aggregateByKey [U](zeroValue:U)(seqOp:(U,V)⇒U,combOp:(U,U) ⇒U)(隐式arg0:ClassTag [U]):RDD [(K,U)]
def aggregateByKey [U](zeroValue:U,numPartitions:Int)(seqOp:(U,V)⇒U, combOp:(U,U)⇒U)(隐式arg0:ClassTag [U]):RDD [(K,U)]
def aggregateByKey [U](zeroValue:U,分区:分区程序)(seqOp:(U, V)⇒U,combOp:(U,U)⇒U)(隐式arg0:ClassTag [U]):RDD [(K,U)]
示例:
val pairRDD = sc.parallelize(List( ("cat",2), ("cat", 5), ("mouse", 4),("cat", 12), ("dog", 12), ("mouse", 2)), 2)
// lets have a look at what is in the partitions
def myfunc(index: Int, iter: Iterator[(String, Int)]) : Iterator[String] = {
iter.map(x => "[partID:" + index + ", val: " + x + "]")
}
pairRDD.mapPartitionsWithIndex(myfunc).collect
res2: Array[String] = Array([partID:0, val: (cat,2)], [partID:0, val: (cat,5)], [partID:0, val: (mouse,4)], [partID:1, val: (cat,12)], [partID:1, val: (dog,12)], [partID:1, val: (mouse,2)])
pairRDD.aggregateByKey(0)(math.max(_, _), _ + _).collect
res3: Array[(String, Int)] = Array((dog,12), (cat,17), (mouse,6))
pairRDD.aggregateByKey(100)(math.max(_, _), _ + _).collect
res4: Array[(String, Int)] = Array((dog,100), (cat,200), (mouse,200))