所以我遇到了速度问题,我有一个需要多次聚合的数据集。
最初,我的团队设置了三个累加器,并在数据上运行一个foreach循环。
的内容WITH Cte AS(
SELECT *,
rn = ROW_NUMBER() OVER(PARTITION BY member_id ORDER BY YearMo)
- ROW_NUMBER() OVER(PARTITION BY member_id, CASE WHEN ismember <> '1' THEN 0 ELSE 1 END ORDER BY YearMo)
FROM #temp_members
)
SELECT
member_id,
YearMo,
ismember,
monthcount = ROW_NUMBER() OVER(PARTITION BY member_id, rn, ismember ORDER BY YearMo)
FROM Cte ORDER BY member_id, YearMo
我正在尝试将这些累积切换为聚合,以便我可以获得速度提升并可以访问累加器进行调试。我目前正试图找出一种方法来同时聚合这三种类型,因为运行3个单独的聚合要慢得多。有没有人想过如何做到这一点?也许聚合在一起,然后模式匹配分成两个RDD?
谢谢
答案 0 :(得分:1)
据我所知,您需要aggregate
zeroValue
,seqOp
和combOp
对应于您的累加器执行的操作。
val zeroValue: (A, B, C) = ??? // (accum1.zero, accum2.zero, accum3.zero)
def seqOp(r: (A, B, C), t: T): (A, B, C) = r match {
case (a, b, c) => {
// Apply operations equivalent to
// accum1.addAccumulator(a, t)
// accum2.addAccumulator(c, t))
// accum3.addAccumulator(c, t)
// and return the first argument
// r
}
}
def combOp(r1: (A, B, C), r2: (A, B, C)): (A, B, C) = (r1, r2) match {
case ((a1, b1, c1), (a2, b2, c2)) => {
// Apply operations equivalent to
// acc1.addInPlace(a1, a2)
// acc2.addInPlace(b1, b2)
// acc3.addInPlace(c1, c2)
// and return the first argument
// r1
}
}
val rdd: RDD[T] = ???
val accums: (A, B, C) = rdd.aggregate(zeroValue)(seqOp, combOp)