一次聚合多个值

时间:2016-02-24 00:55:00

标签: scala apache-spark

所以我遇到了速度问题,我有一个需要多次聚合的数据集。

最初,我的团队设置了三个累加器,并在数据上运行一个foreach循环。

的内容
WITH Cte AS(
    SELECT *,
        rn = ROW_NUMBER() OVER(PARTITION BY member_id ORDER BY YearMo)
            - ROW_NUMBER() OVER(PARTITION BY member_id, CASE WHEN ismember <> '1' THEN 0 ELSE 1 END ORDER BY YearMo)
    FROM #temp_members
)
SELECT
    member_id, 
    YearMo,
    ismember,
    monthcount = ROW_NUMBER() OVER(PARTITION BY member_id, rn, ismember ORDER BY YearMo)
FROM Cte ORDER BY member_id, YearMo

我正在尝试将这些累积切换为聚合,以便我可以获得速度提升并可以访问累加器进行调试。我目前正试图找出一种方法来同时聚合这三种类型,因为运行3个单独的聚合要慢得多。有没有人想过如何做到这一点?也许聚合在一起,然后模式匹配分成两个RDD?

谢谢

1 个答案:

答案 0 :(得分:1)

据我所知,您需要aggregate zeroValueseqOpcombOp对应于您的累加器执行的操作。

val zeroValue: (A, B, C) = ??? // (accum1.zero, accum2.zero, accum3.zero)

def seqOp(r: (A, B, C), t: T): (A, B, C) = r match {
  case (a, b, c) =>  {
     // Apply operations equivalent to
     // accum1.addAccumulator(a, t)
     // accum2.addAccumulator(c, t))
     // accum3.addAccumulator(c, t)
     // and return the first argument
     // r
  }
}

def combOp(r1: (A, B, C), r2: (A, B, C)): (A, B, C) = (r1, r2) match {

  case ((a1, b1, c1), (a2, b2, c2)) => {
     // Apply operations equivalent to
     // acc1.addInPlace(a1, a2)
     // acc2.addInPlace(b1, b2)
     // acc3.addInPlace(c1, c2)
     // and return the first argument
     // r1
  }
}

val rdd: RDD[T] = ???

val accums: (A, B, C) = rdd.aggregate(zeroValue)(seqOp, combOp)