在Scala(Spark)中聚合RDD的总和

时间:2018-01-31 18:14:00

标签: scala apache-spark rdd

如果我有一个变量如books: RDD[(String, Integer, Integer)],我如何合并具有相同字符串的键(可以表示标题),然后将相应的两个整数相加(可以表示页面和价格)。

例如:

[("book1", 20, 10),
 ("book2", 5, 10),
 ("book1", 100, 100)]

变为

[("book1", 120, 110),
 ("book2", 5, 10)]

3 个答案:

答案 0 :(得分:3)

使用RDD,您可以使用reduceByKey

case class Book(name: String, i: Int, j: Int) {
  def +(b: Book) = if(name == b.name) Book(name, i + b.i, j + b.j) else throw Exception
}

val rdd = sc.parallelize(Seq(
   Book("book1", 20, 10), 
   Book("book2",5,10), 
   Book("book1",100,100)))

val aggRdd = rdd.map(book => (book.name, book))
   .reduceByKey(_+_) // reduce calling our defined `+` function
   .map(_._2)        // we don't need the tuple anymore, just get the Books

aggRdd.foreach(println)
// Book(book1,120,110)
// Book(book2,5,10)

答案 1 :(得分:2)

尝试首先将其转换为键元组[{1}},然后执行RDD

reduceByKey

输出:

yourRDD.map(t => (t._1, (t._2, t._3)))
.reduceByKey((acc, elem) => (acc._1 + elem._1, acc._2 + elem._2))

答案 2 :(得分:1)

只需使用Dataset

val spark: SparkSession = SparkSession.builder.getOrCreate()

val rdd = spark.sparkContext.parallelize(Seq(
  ("book1", 20, 10), ("book2", 5, 10), ("book1", 100, 100)
))

spark.createDataFrame(rdd).groupBy("_1").sum().show()

// +-----+-------+-------+                                                         
// |   _1|sum(_2)|sum(_3)|
// +-----+-------+-------+
// |book1|    120|    110|
// |book2|      5|     10|
// +-----+-------+-------+