我需要实现每天的累计总和。
例如,我的数据集如下:
buyer | bread | date |
---------------------------
b1 | 2 | 2018-01-01|
b1 | 3 | 2018-01-02|
b1 | 1 | 2018-01-04|
b2 | 2 | 2018-01-02|
我需要进行如下选择:
buyer | cum_sum_on_01_01 | cum_sum_on_01_02 | cum_sum_on_01_03 | cum_sum_on_01_04 | cum_sum_on_01_05 |...
----------------------------------------------------------------------------------------------------------
b1 | 2 | 5 | 5 | 6 | 6 |...
b2 | 0 | 2 | 2 | 2 | 2 |...
如何做到?
答案 0 :(得分:1)
without built-in function
的意义是什么?目前,在ClickHouse中获得累计总和的唯一方法是arrayCumSum
。因此,答案是构建候选数组并将其传递给arrayCumSum
。步骤如下:
SELECT
buyer,
groupArray(bread) AS breads
FROM
(
SELECT
buyer,
sum(bread) AS bread,
date
FROM bbd
ALL RIGHT JOIN
(
WITH
toDate('2018-01-01') AS min_date,
toDate('2018-01-31') AS max_date
SELECT
arrayJoin(buyers) AS buyer,
arrayJoin(arrayMap(i -> (min_date + toIntervalDay(i)), range(toUInt64((max_date - min_date) + 1)))) AS date
FROM
(
SELECT groupUniqArray(buyer) AS buyers
FROM bbd
)
) USING (buyer, date)
GROUP BY
buyer,
date
ORDER BY
buyer ASC,
date ASC
)
GROUP BY buyer
┌─buyer─┬─breads──────────────────────────────────────────────────────────┐
│ b1 │ [2,3,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0] │
│ b2 │ [0,2,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0] │
└───────┴─────────────────────────────────────────────────────────────────┘
将groupArray(bread) AS breads
替换为arrayCumSum(groupArray(bread)) AS breads
┌─buyer─┬─breads──────────────────────────────────────────────────────────┐
│ b1 │ [2,5,5,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6] │
│ b2 │ [0,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2] │
└───────┴─────────────────────────────────────────────────────────────────┘
答案 1 :(得分:0)
可接受的答案非常好,您确实应该使用内置的arrayCumSum函数来计算累积和。但是,如果原始问题的动机之一是找出在ClickHouse本身不支持的情况下通常创建累积/折叠样式函数的方法(例如CumMax,CumMin等) ,这是一种适用于ClickHouse中任何聚合函数的方法。
实现此目的的核心逻辑是使用arrayReduceInRanges并使用arrayMap和arrayEnumerate生成(1, 1), (1, 2), ... (1, n)
形式的所有元组范围。然后,无论您选择哪个函数作为arrayReduceInRanges的高阶聚合函数,例如“ sum”或“ max”将变成该函数基于数组的累积形式。逻辑如下所示:
WITH arr as (SELECT groupArray(some_col) AS arr_some_col FROM some_table)
SELECT
arrayReduceInRanges(
'sum'
arrayMap(x -> (1, x), arrayEnumerate(arr_some_col))
arr_some_col
)
FROM arr
从这里,您可以arrayJoin从数组中取回值,或将它们保持为数组形式以进行进一步的计算。
对于您的面包专用应用程序,可以使用上述核心逻辑(假设您的表名为Bread_data)工作:
WITH ordered AS (SELECT * FROM bread_data ORDER BY date, buyer),
agg AS (
SELECT
buyer,
untuple(
arrayJoin(
arrayZip(
groupArray(date),
arrayReduceInRanges(
-- 'sum' or any ClickHouse aggregate function.
'sum',
arrayMap(x -> (1, x), arrayEnumerate(groupArray(bread))),
groupArray(bread)
)
)
)
)
FROM ordered
GROUP BY buyer
)
SELECT buyer, _ut_1 AS date, _ut_2 as cum_bread
FROM agg
ORDER BY date
请注意第一个WITH
子句,该子句按日期和购买者对表进行排序,这样可以确保后续的groupArray调用以相同的一致顺序构造其数组(ClickHouse文档指出,否则,对groupArray的任何调用都可以按随机顺序构造元素。
这似乎很复杂,但是当您使用第一个核心逻辑片段将其分解时,由于此处的许多语法都围绕数组分组和取消分组,因此我们可以在数组空间中进行主要工作,因此它应该希望有一定的直觉。
输出将如下所示:
+-------+------------+-----------+
| buyer | date | cum_bread |
+-------+------------+-----------+
| b1 | 2018-01-01 | 2 |
| b2 | 2018-01-02 | 2 |
| b1 | 2018-01-02 | 5 |
| b1 | 2018-01-04 | 6 |
+-------+------------+-----------+