使用ReduceByKey对值列表进行分组

时间:2016-06-01 23:17:44

标签: scala hadoop apache-spark mapreduce apache-spark-sql

我想将每个键的值列表分组,并且正在执行以下操作:

sc.parallelize(Array(("red", "zero"), ("yellow", "one"), ("red", "two"))).groupByKey().collect.foreach(println)

(red,CompactBuffer(zero, two))
(yellow,CompactBuffer(one))

但是我注意到来自Databricks的博客文章,它建议不要将groupByKey用于大型数据集。

Avoid GroupByKey

有没有办法使用reduceByKey实现相同的结果?

我尝试了这个,但它连接了所有的值。顺便说一下,对于我的情况,key和value都是字符串类型。

sc.parallelize(Array(("red", "zero"), ("yellow", "one"), ("red", "two"))).reduceByKey(_ ++ _).collect.foreach(println)

(red,zerotwo)
(yellow,one)

2 个答案:

答案 0 :(得分:1)

使用aggregateByKey

 sc.parallelize(Array(("red", "zero"), ("yellow", "one"), ("red", "two")))
.aggregateByKey(ListBuffer.empty[String])(
        (numList, num) => {numList += num; numList},
         (numList1, numList2) => {numList1.appendAll(numList2); numList1})
.mapValues(_.toList)
.collect()

scala> Array[(String, List[String])] = Array((yellow,List(one)), (red,List(zero, two)))

有关使用可变数据集aggregateByKey背后的基本原理的详情,请参阅this answer ListBufferthis link

编辑:

Is there a way to achieve the same result using reduceByKey?

以上情况实际上表现较差,详见@ zero323评论。

答案 1 :(得分:0)

sc.parallelize(Array(("red", "zero"), ("yellow", "one"), ("red", "two")))
.map(t => (t._1,List(t._2)))
.reduceByKey(_:::_)
.collect()
Array[(String, List[String])] = Array((red,List(zero, two)), (yellow,List(one)))