如何获取每对列的计数和最新记录?

时间:2019-04-06 19:32:55

标签: sql amazon-athena presto

我有一个包含4列(A, B, C, D)的雅典娜表,我想找到:

  1. AB的每个唯一组合关联的行数
  2. 相同的AB对中最近一行的C值,其中D是时间戳记

例如,如果这是输入数据

+---+---+-----+------------+
| A | B |  C  |     D      |
+---+---+-----+------------+
| 1 | 1 | 'a' | 2019-04-04 |
| 1 | 1 | 'b' | 2019-04-03 |
| 1 | 2 | 'c' | 2019-04-02 |
| 1 | 3 | 'd' | 2019-04-01 |
| 2 | 2 | 'e' | 2019-04-03 |
| 2 | 2 | 'f' | 2019-04-04 |
+---+---+-----+------------+

这是所需的输出

+---+---+----------+-------+
| A | B | newest_C | count |
+---+---+----------+-------+
| 1 | 1 | 'a'      |     2 |
| 1 | 2 | 'c'      |     1 |
| 1 | 3 | 'd'      |     1 |
| 2 | 2 | 'f'      |     2 |
+---+---+----------+-------+

我不太喜欢查询,我的最佳尝试是:

加入两个子查询,其中一个进行计数,另一个则按时间对每一行进行排名。然后在联接上,仅选择排名最高的行。

WITH t1 AS (
    SELECT A, B, count(*)
    FROM data
    GROUP BY A, B
),
t2 AS (
    SELECT A, B, C, RANK() OVER (PARTITION BY A, B ORDER BY D DESC) AS rank
    FROM data
)
SELECT t1.A, t1.B, t2.newest_C, t1.count
FROM t1 LEFT JOIN t2 ON t1.A = t2.A AND t1.B = t2.B
WHERE rank = 1

3 个答案:

答案 0 :(得分:1)

Presto具有一些复杂的聚合功能。所以:

select a, b, count(*) as cnt,
       max_by(c, d)
from t
group by a, b;

max_by()documentation中进行了说明。

答案 1 :(得分:0)

这可以使用Presto window functions来实现:

SELECT a, b, c AS newest_c, cnt
FROM (
    SELECT 
        t.*,
        COUNT(*)     OVER(PARTITION BY a, b) AS cnt,
        ROW_NUMBER() OVER(PARTITION BY a, b ORDER BY d DESC) AS rn
    FROM mytable t
) x WHERE rn = 1

在子查询中,窗口函数可用于计算具有相同(a, b)元组的记录数,并通过降序d对记录进行排名。然后,外部查询将过滤每个组中的最新记录。

答案 2 :(得分:0)

戈登·利诺夫(Gordon Linoff)的解决方案还可以。如果您不想使用max_by:

SELECT t1.a, t1.b, t1.c, t2.count
FROM data AS t1 
INNER JOIN
  (SELECT a, b, count(*) AS count, max(d) AS d 
  FROM data 
  GROUP BY a,b) AS t2
ON t1.a = t2.a AND t1.b = t2.b AND t1.d = t2.d

这里是a demo!