我正在处理这样一个表:
ID f1
001 1
001 2
001 3
002 0
002 7
我想计算相同ID的f1列的总和,并用sum创建一个新列,即:
ID f1 sum_f1
001 1 6
001 2 6
001 3 6
002 0 7
002 7 7
我的解决方案是使用reduceByKey
计算总和,然后将结果与原始表连接:
val table = sc.parallelize(Seq(("001",1),("001",2),("001",3),("002",0),("002",7)))
val sum = table.reduceByKey(_ + _)
val result = table.leftOuterJoin(sum).map{ case (a,(b,c)) => (a, b, c.getOrElse(-1) )}
我得到了正确的结果:
result.collect.foreach(println)
输出:
(002,0,7)
(002,7,7)
(001,1,6)
(001,2,6)
(001,3,6)
问题是代码中有2个shuffle阶段,一个在reduceByKey中,另一个在leftOuterJoin中,但如果我在Hadoop MapReduce中编写代码,只需1个shuffle阶段就可以轻松获得相同的结果(在reduce阶段不止一次使用outputer.collect
函数)。
所以我想知道是否有更好的方法来进行一次洗牌。任何建议将不胜感激。
答案 0 :(得分:1)
另一种方法是使用aggregateByKey
。这可能难以理解方法
但是来自spark docs:
(
groupByKey
)注意:此操作可能非常昂贵。如果您按顺序分组 在每个密钥上执行聚合(例如总和或平均), 使用PairRDDFunctions.aggregateByKey
或PairRDDFunctions.reduceByKey
将提供更好的表现。
同样aggregateByKey
是一个通用函数,因此值得了解。
当然,我们这里并没有做“简单聚合如总和”
这种方法与groupByKey
的性能优势可能不存在。
显然,对真实数据的两种方法进行基准测试都是一个好主意。
以下是详细的实施:
// The input as given by OP here: http://stackoverflow.com/questions/36455419/spark-reducebykey-and-keep-other-columns
val table = sc.parallelize(Seq(("001", 1), ("001", 2), ("001", 3), ("002", 0), ("002", 7)))
// zero is initial value into which we will aggregate things.
// The second element is the sum.
// The first element is the list of values which contributed to this sum.
val zero = (List.empty[Int], 0)
// sequencer will receive an accumulator and the value.
// The accumulator will be reset for each key to 'zero'.
// In this sequencer we add value to the sum and append to the list because
// we want to keep both.
// This can be thought of as "map" stage in classic map/reduce.
def sequencer(acc: (List[Int], Int), value: Int) = {
val (values, sum) = acc
(value :: values, sum + value)
}
// combiner combines two lists and sums into one.
// The reason for this is the sequencer may run in different partitions
// and thus produce partial results. This step combines those partials into
// one final result.
// This step can be thought of as "reduce" stage in classic map/reduce.
def combiner(left: (List[Int], Int), right: (List[Int], Int)) = {
(left._1 ++ right._1, left._2 + right._2)
}
// wiring it all together.
// Note the type of result it produces:
// Each key will have a list of values which contributed to the sum, sum the sum itself.
val result: RDD[(String, (List[Int], Int))] = table.aggregateByKey(zero)(sequencer, combiner)
// To turn this to a flat list and print, use flatMap to produce:
// (key, value, sum)
val flatResult: RDD[(String, Int, Int)] = result.flatMap(result => {
val (key, (values, sum)) = result
for (value <- values) yield (key, value, sum)
})
// collect and print
flatResult.collect().foreach(println)
这会产生:
(001,1,6)
(001,2,6)
(001,3,6)
(002,0,7)
(002,7,7)
这也是上面完全可运行版本的要点 如果你想引用它:https://gist.github.com/ppanyukov/253d251a16fbb660f225fb425d32206a
答案 1 :(得分:0)
您可以使用groupByKey
获取值列表,获取总和并使用flatMapValues
重新创建行:
val g = table.groupByKey().flatMapValues { f1s =>
val sum = f1s.reduce(_ + _)
f1s.map(_ -> sum)
}
但是此代码中的reduce
在本地工作,因此如果单个键的值太多,则会失败。
另一种方法是保留join
,但首先是分区,因此连接很便宜:
val partitioned = table.partitionBy(
new org.apache.spark.HashPartitioner(table.partitions.size))
partitioned.cache // May or may not improve performance.
val sum = partitioned.reduceByKey(_ + _)
val result = partitioned.join(sum)
我无法猜出哪个更快。我对所有选项进行了基准测试。