Question

Redshift中的反规范化结构和计划是继续创建记录，而检索时只考虑针对用户的最新属性。

以下是表格：

user_id   state  created_at
1         A      15-10-2015 02:00:00 AM
2         A      15-10-2015 02:00:01 AM
3         A      15-10-2015 02:00:02 AM
1         B      15-10-2015 02:00:03 AM
4         A      15-10-2015 02:00:04 AM
5         B      15-10-2015 02:00:05 AM

所需的结果集是：

user_id   state  created_at
2         A      15-10-2015 02:00:01 AM
3         A      15-10-2015 02:00:02 AM
4         A      15-10-2015 02:00:04 AM

我有查询检索所述结果：

select user_id, first_value AS state
from (
   select user_id, first_value(state) OVER (
                     PARTITION BY user_id
                     ORDER BY created_at desc
                     ROWS between UNBOUNDED PRECEDING and CURRENT ROW)
   from customer_properties
   order by created_at) t
where first_value = 'A'

这是检索的最佳方式还是可以改进查询？

Answer 1

最佳查询取决于各种细节：查询谓词的选择性，基数，数据分布。如果state = 'A'是选择条件（查看行符合条件），则此查询应该大大加快：

SELECT c.user_id, c.state
FROM   customer_properties c
LEFT   JOIN customer_properties c1 ON c1.user_id = c.user_id
                                  AND c1.created_at > c.created_at
WHERE  c.state = 'A'
AND    c1.user_id IS NULL;

已提供，(state)（甚至(state, user_id, created_at)）上有一个索引，(user_id, created_at)上有另一个索引。

有多种方法可以确保该行的更高版本不存在：

Select rows which are not present in other table

如果'A'是state中的常用值，则此更通用的查询会更快：

SELECT user_id, state
FROM (
   SELECT user_id, state
        , row_number() OVER (PARTITION BY user_id ORDER BY created_at DESC) AS rn
   FROM   customer_properties
   ) t
WHERE  t.rn = 1
AND    t.state = 'A';

我已删除NULLS LAST，假设created_at已定义NOT NULL。另外，我不认为Redshift有它：

PostgreSQL sort by datetime asc, null first?

两个查询都应该使用Redshift的有限功能。使用现代Postgres，有更好的选择：

如果最新的行匹配，您的原始文件将按user_id返回所有行。你必须折叠重复，不必要的工作......

根据最新的状态/属性值检索记录

1 个答案: