Spark CombineByKey

时间:2017-07-17 14:02:01

标签: scala apache-spark

我有以下格式的Spark RDD

示例RDD:

Array[(String, (String, Double))] = Array(
       (2014-01-12 00:00:00.0,("XXX",829.95)), 
       (2013-08-28 00:00:00.0,("YYY",469.95000000000005)), 
       (2013-11-01 00:00:00.0,("ZZZ",129.99)), 
       (2013-07-25 00:00:00.0,("XYZ",879.8599999999999)), 
       (2013-10-19 00:00:00.0,
       ("POI",989.94))
)

我正在尝试使用combineByKey对来自RDD的给定键的Double值求和并尝试使用以下命令

rdd.combineByKey(
  (x:String,y:Double) => (x,y),
  (acc:(String, Double), v:(String, Double)) => acc._2  + v._2, 
  (acc2:(Double), acc3:(Double)) => (acc2 + acc3)
)

但得到以下错误....

 <console>:46: error: overloaded method value combineByKey with
 alternatives:   [C](createCombiner: ((String, Double)) => C,
 mergeValue: (C, (String, Double)) => C, mergeCombiners: (C, C) =>
 C)org.apache.spark.rdd.RDD[(String, C)] <and>   [C](createCombiner:
 ((String, Double)) => C, mergeValue: (C, (String, Double)) => C,
 mergeCombiners: (C, C) => C, numPartitions:
 Int)org.apache.spark.rdd.RDD[(String, C)] <and>   [C](createCombiner:
 ((String, Double)) => C, mergeValue: (C, (String, Double)) => C,
 mergeCombiners: (C, C) => C, partitioner:
 org.apache.spark.Partitioner, mapSideCombine: Boolean, serializer:
 org.apache.spark.serializer.Serializer)org.apache.spark.rdd.RDD[(String,
 C)]  cannot be applied to ((String, Double) => (String, Double),
 ((String, Double), (String, Double)) => Double, (Double, Double) =>
 Double)
               custMaxOrdr.combineByKey((x:String,y:Double) => (x,y) ,(acc:(String,Double),valu:(String,Double)) => acc._2+valu._2,
 (acc2:(Double),acc3:(Double)) => (acc2+acc3))

任何帮助表示感谢。

由于 Rammy

1 个答案:

答案 0 :(得分:3)

您传递的函数类型与预期类型不匹配。让我们看一下combineByKey的签名:

def combineByKey[C](
  createCombiner: V => C,
  mergeValue: (C, V) => C,
  mergeCombiners: (C, C) => C): RDD[(K, C)]

所以,你需要提供:

  • 输入C:预期合并结果的类型,在您的情况下,我假设它是Double。当然,如果没有明确提供,可以由编译器推断出这种类型
  • createCombiner: V => C:在我们的例子中,类型为(String, Double) => Double的函数;你传递(x:String,y:Double) => (x,y),它有不同的类型;根据您的描述,我假设您只是希望此函数从元组中提取Double,因此您需要:(in: (String, Double)) => in._2
  • mergeValue: (C, V) => C:我们的情况是((String, Double), Double) => Double,它与您提供的类型((String,Double), (String,Double)) => Double
  • 的功能不匹配
  • mergeCombiners: (C, C) => C在我们的案例中为(Double, Double) => Double - 此处您的函数匹配

总而言之,这会将每个键的双值相加:

val result: RDD[(String, Double)] = rdd.combineByKey(
  (in: (String, Double)) => in._2,
  (acc: Double, valu: (String, Double)) => acc + valu._2,
  (acc2: Double, acc3: Double) => acc2 + acc3
)

所有功能都可以省略以下类型:

val result2: RDD[(String, Double)] = rdd.combineByKey(
  _._2,
  (acc, valu) => acc + valu._2,
  (acc2, acc3) => acc2 + acc3
)