Question

假设您有一个名为tracker的表，其中包含以下记录。

issue_id  |  ingest_date         |  verb,status
10         2015-01-24 00:00:00    1,1
10         2015-01-25 00:00:00    2,2
10         2015-01-26 00:00:00    2,3
10         2015-01-27 00:00:00    3,4
11         2015-01-10 00:00:00    1,3
11         2015-01-11 00:00:00    2,4

我需要以下结果

10         2015-01-26 00:00:00    2,3
11         2015-01-11 00:00:00    2,4

我正在尝试这个查询

select * 
from etl_change_fact 
where ingest_date = (select max(ingest_date) 
                     from etl_change_fact);

然而，这只给我

10    2015-01-26 00:00:00    2,3

此记录。

但是，我希望所有唯一记录（change_id）与

（a）max（ingest_date）AND

（b）动词列优先级为（2 - 首选，1 - 首选，3 - 最后首选）

因此，我需要以下结果

10    2015-01-26 00:00:00    2,3
11    2015-01-11 00:00:00    2,4

请帮我高效查询。

P.S：我不会索引ingest_date，因为我将在Distributed Computing安装程序中将其设置为“分发密钥”。我是数据仓库和查询的新手。

因此，请帮助我优化方式来打我的TB大小的数据库。

Answer 1

这是一个典型的＆＃34;最大的每组n＆＃34;问题。如果您在此处搜索此标记，您将获得大量解决方案 - 包括MySQL。

对于Postgres，最快捷的方法是使用distinct on（这是对SQL语言的Postgres专有扩展）

select distinct on (issue_id) issue_id, ingest_date, verb, status
from etl_change_fact
order by issue_id, 
         case verb 
            when 2 then 1 
            when 1 then 2
            else 3
         end, ingest_date desc;

您可以增强原始查询以使用共同相关的子查询来实现相同的目的：

select f1.* 
from etl_change_fact f1
where f1.ingest_date = (select max(f2.ingest_date) 
                        from etl_change_fact f2
                        where f1.issue_id = f2.issue_id);

修改

对于过时且不受支持的Postgres版本，您可能会使用以下内容逃脱：

select f1.* from etl_change_fact f1 where f1.ingest_date = (select f2.ingest_date from etl_change_fact f2 where f1.issue_id = f2.issue_id order by case verb when 2 then 1 when 1 then 2 else 3 end, ingest_date desc limit 1);

SQLFiddle示例：http://sqlfiddle.com/#!15/3bb05/1

PostgreSQL中的优化查询

1 个答案: