如果我有一个变量如books: RDD[(String, Integer, Integer)]
,我如何合并具有相同字符串的键(可以表示标题),然后将相应的两个整数相加(可以表示页面和价格)。
例如:
[("book1", 20, 10),
("book2", 5, 10),
("book1", 100, 100)]
变为
[("book1", 120, 110),
("book2", 5, 10)]
答案 0 :(得分:3)
使用RDD
,您可以使用reduceByKey
。
case class Book(name: String, i: Int, j: Int) {
def +(b: Book) = if(name == b.name) Book(name, i + b.i, j + b.j) else throw Exception
}
val rdd = sc.parallelize(Seq(
Book("book1", 20, 10),
Book("book2",5,10),
Book("book1",100,100)))
val aggRdd = rdd.map(book => (book.name, book))
.reduceByKey(_+_) // reduce calling our defined `+` function
.map(_._2) // we don't need the tuple anymore, just get the Books
aggRdd.foreach(println)
// Book(book1,120,110)
// Book(book2,5,10)
答案 1 :(得分:2)
尝试首先将其转换为键元组[{1}},然后执行RDD
:
reduceByKey
输出:
yourRDD.map(t => (t._1, (t._2, t._3)))
.reduceByKey((acc, elem) => (acc._1 + elem._1, acc._2 + elem._2))
答案 2 :(得分:1)
只需使用Dataset
:
val spark: SparkSession = SparkSession.builder.getOrCreate()
val rdd = spark.sparkContext.parallelize(Seq(
("book1", 20, 10), ("book2", 5, 10), ("book1", 100, 100)
))
spark.createDataFrame(rdd).groupBy("_1").sum().show()
// +-----+-------+-------+
// | _1|sum(_2)|sum(_3)|
// +-----+-------+-------+
// |book1| 120| 110|
// |book2| 5| 10|
// +-----+-------+-------+