如何对存储为RDD的数据使用reducedByKey而不是GroupBy?
目的是按键分组,然后对值求和。
我有一个工作Scala流程来查找优势比。
问题:
我们正在摄取脚本的数据急剧增长并因内存/磁盘问题而开始失败。这里的主要问题是因为" GROUP BY"而进行了大量的改组。
示例数据:
(543040000711860,543040000839322,0,0,0,0)
(543040000711860,543040000938728,0,0,1,1)
(543040000711860,543040000984046,0,0,1,1)
(543040000711860,543040001071137,0,0,1,1)
(543040000711860,543040001121115,0,0,1,1)
(543040000711860,543040001281239,0,0,0,0)
(543040000711860,543040001332995,0,0,1,1)
(543040000711860,543040001333073,0,0,1,1)
(543040000839322,543040000938728,0,1,0,0)
(543040000839322,543040000984046,0,1,0,0)
(543040000839322,543040001071137,0,1,0,0)
(543040000839322,543040001121115,0,1,0,0)
(543040000839322,543040001281239,1,0,0,0)
(543040000839322,543040001332995,0,1,0,0)
(543040000839322,543040001333073,0,1,0,0)
(543040000938728,543040000984046,0,0,1,1)
(543040000938728,543040001071137,0,0,1,1)
(543040000938728,543040001121115,0,0,1,1)
(543040000938728,543040001281239,0,0,0,0)
(543040000938728,543040001332995,0,0,1,1)
以下是转换数据的代码:
var groupby = flags.groupBy(item =>(item._1, item._2) )
var counted_group = groupby.map(item => (item._1, item._2.map(_._3).sum, item._2.map(_._4).sum, item._2.map(_._5).sum, item._2.map(_._6).sum))
结果:
((3900001339662,3900002247644),6,12,38,38)
((543040001332995,543040001352893),112,29,57,57)
((3900001572602,543040001071137),1,0,1,1)
((3900001640810,543040001281239),2,1,0,0)
((3900001295323,3900002247644),8,21,8,8)
我需要将其转换为" REDUCE BY KEY"这样在发送回来之前,每个分区中的数据都会减少。我正在使用RDD,所以没有直接的方法来做REDUCE BY。
答案 0 :(得分:1)
我认为我使用aggregateByKey解决了这个问题。
重新映射RDD以生成键值对
func
然后对结果应用aggregateByKey函数,现在每个分区返回聚合结果而不是组结果。
val rddPair = flags.map(item => ((item._1, item._2), (item._3, item._4, item._5, item._6)))
答案 1 :(得分:0)
reducyByKey
需要RDD[(K, V)]
即键值对,因此您应首先创建rdd对
val rddPair = flags.map(item => ((item._1, item._2), (item._3, item._4, item._5, item._6)))
然后,您可以使用上面reduceByKey
上的rddPair
作为
rddPair.reduceByKey((x, y)=> (x._1+y._1, x._2+y._2, x._3+y._3, x._4+y._4))
我希望答案很有帮助