使用combineByKey获取输出为(key,iterable [values])

时间:2017-10-19 05:33:14

标签: scala hadoop apache-spark rdd

我正在尝试将RDD(key,value)转换为RDD(key,iterable[value]),与groupByKey方法返回的输出相同。 但由于groupByKey效率不高,我试图在RDD上使用combineByKey,但是,它不起作用。以下是使用的代码:

val data= List("abc,2017-10-04,15.2",
          "abc,2017-10-03,19.67", 
          "abc,2017-10-02,19.8",
          "xyz,2017-10-09,46.9", 
          "xyz,2017-10-08,48.4",
          "xyz,2017-10-07,87.5", 
          "xyz,2017-10-04,83.03", 
          "xyz,2017-10-03,83.41",
          "pqr,2017-09-30,18.18", 
          "pqr,2017-09-27,18.2", 
          "pqr,2017-09-26,19.2", 
          "pqr,2017-09-25,19.47", 
          "abc,2017-07-19,96.60",
          "abc,2017-07-18,91.68", 
          "abc,2017-07-17,91.55")
val rdd = sc.parallelize(templines)
val rows = rdd.map(line => {
  val row = line.split(",")
  ((row(0), row(1)), row(2))
})

// re partition and sort based key    
val op = rows.repartitionAndSortWithinPartitions(new CustomPartitioner(4))
val temp = op.map(f => (f._1._1, (f._1._2, f._2)))

val mergeCombiners = (t1: (String, List[String]), t2: (String, List[String])) => 
    (t1._1 + t2._1, t1._2.++(t2._2))
val mergeValue = (x: (String, List[String]), y: (String, String)) => {
  val a = x._2.+:(y._2)
  (x._1, a)
}

// createCombiner, mergeValue, mergeCombiners
val x = temp.combineByKey(
  (t1: String, t2: String) => (t1, List(t2)),
  mergeValue,
  mergeCombiners)

temp.combineByKey给出编译时错误,我无法得到它。

1 个答案:

答案 0 :(得分:3)

如果你想要一个类似于groupByKey给你的输出,那么你绝对应该使用groupByKey而不是其他方法。与使用reduceByKey后跟聚合相比,combineByKeygroupByKey等效率更高(给您的结果与其他groupBy方法之一相同给出)。

由于想要的结果是RDD[key,iterable[value]],自行构建列表或让groupByKey执行此操作将导致相同的工作量。没有必要自己重新实现groupByKeygroupByKey的问题不在于它的实现,而在于分布式架构。

有关groupByKey和这些优化类型的详细信息,我建议您阅读更多here