Question

我正在进行红移 - 我有一张像

这样的表

userid  oid version number_of_objects
1       ab  1       10
1       ab  2       20
1       ab  3       17
1       ab  4       16
1       ab  5       14
1       cd  1       5
1       cd  2       6
1       cd  3       9
1       cd  4       12
2       ef  1       4
2       ef  2       3
2       gh  1       16
2       gh  2       12
2       gh  3       21

我想从此表中选择每个oid的最大版本号，并获取userid和行号。

当我尝试这个时，不幸的是我已经把整张桌子拿回来了：

SELECT MAX(version), oid, userid, number_of_objects
FROM table
GROUP BY oid, userid, number_of_objects
LIMIT 10;

但真正的结果，我正在寻找的是：

userid  oid MAX(version)    number_of_objects
1       ab  5               14
1       cd  4               12
2       ef  2               3
2       gh  3               21

以某种方式明显不起作用，它说：

不支持SELECT DISTINCT ON

你有什么想法吗？

更新：与此同时，我想出了这个解决方法，但我觉得这不是最聪明的解决方案。它也很慢。但它至少起作用。以防万一：

SELECT * FROM table,
   (SELECT MAX(version) as maxversion, oid, userid
    FROM table
    GROUP BY oid, userid
    ) as maxtable
    WHERE  table.oid = maxtable.oid
   AND table.userid = maxtable.userid
   AND table.version = maxtable.version
LIMIT 100;

你有更好的解决方案吗？

Answer 1

如果redshift有窗函数，你可以试试这个：

SELECT * 
FROM (
  select oid, 
         userid, 
         version,
         max(version) over (partition by oid, userid) as max_version, 
  from the_table
) t
where version = max_version;

我希望它比使用group by的自联接更快。

另一种选择是使用row_number()函数：

SELECT * 
FROM (
  select oid, 
         userid, 
         version,
         row_number() over (partition by oid, userid order by version desc) as rn, 
  from the_table
) t
where rn = 1;

这个问题更多的是个人品味问题。表现明智，我不希望有任何区别。

Answer 2

select      distinct
            first_value(userid) over(
                  partition by oid 
                  order by version desc
                  rows between unbounded preceding and unbounded following
                  ) as userid
            , oid
            , first_value(version) over(
                  partition by oid
                  order by version desc
                  rows between unbounded preceding and unbounded following
                  ) as max_version
            , first_value(number_of_objects) over(
                  partition by oid
                  order by version desc
                  rows between unbounded preceding and unbounded following
                  ) as number_of_objects

from        table
order by    oid;

AWS Redshift Documentation first_value

如果nulls last可为空，请不要忘记顺序中的version。

Answer 3

长话短说：骑马。

作者的方法应该在较小的表上更快并且提取示例数据，但是窗口方法在性能上将更加一致，并且在整个表上将更快。

以下是我在桌子上做的一些解释性结果，该结果具有17列，184 121 798行和12 809 740个唯一ID（每个ID平均14个版本，但最多可以有40个版本）。

快速摘要：

Tomi的做法：cost = 5983958.76..67801689853856.94（第一行6 * 10 ^ 6，整个表格7 * 10 ^ 13）

@a_horse_with_no_name方法：cost = 1000027117538.39..1000031720583.59（任何查询10 ^ 12）

@Merlin：与上述方法几乎完全相同。

原始方法

explain
SELECT * FROM table t,
(SELECT MAX(version) as maxversion, id
 FROM table
 GROUP BY id
) as maxtable
WHERE  t.id = maxtable.id
       AND t.version = maxtable.maxversion;

XN Hash Join DS_DIST_NONE  (cost=5983958.76..67801689853856.94 rows=63811541 width=590)
  Hash Cond: ((("outer".id)::text = ("inner".id)::text) AND ("outer".version = "inner".maxversion))
  ->  XN Seq Scan on equipment_visits ev  (cost=0.00..1841218.08 rows=184121808 width=418)
  ->  XN Hash  (cost=5063349.72..5063349.72 rows=184121808 width=172)
        ->  XN Subquery Scan maxtable  (cost=2761827.12..5063349.72 rows=184121808 width=172)
              ->  XN HashAggregate  (cost=2761827.12..3222131.64 rows=184121808 width=44)
                    ->  XN Seq Scan on equipment_visits  (cost=0.00..1841218.08 rows=184121808 width=44)

因此，第一行和所有行的成本分别为5983958.76（6 * 10 ^ 6）和67801689853856.94（7 * 10 ^ 13）。

a_horse_with_no_name的方法

@a_horse_with_no_name提供的两个解决方案都有几乎完全一样的计划，因此我将仅粘贴其中一个

explain
SELECT * 
FROM (
  select *,
         row_number() over (partition by id order by version desc) as rn
  from table
)
where rn = 1;

给予

  Filter: (rn = 1)
  ->  XN Window  (cost=1000027117538.39..1000029419060.99 rows=184121808 width=44)
        Partition: id
        Order: version
        ->  XN Sort  (cost=1000027117538.39..1000027577842.91 rows=184121808 width=44)
              Sort Key: id, version
              ->  XN Seq Scan on table  (cost=0.00..1841218.08 rows=184121808 width=44)

梅林的方法

@Merlin提供的解决方案似乎不完整，因为它没有返回最新版本的所有值，但其性能与第二种选择相似

explain
select      distinct
              id
            , first_value(version) over(
                  partition by id
                  order by version desc
                  rows between unbounded preceding and unbounded following
                  ) as max_version
            , first_value(additional_col) over(
                  partition by id
                  order by version desc
                  rows between unbounded preceding and unbounded following
                  ) as additional_col

from        table t;

给予

XN Unique  (cost=1000027117538.39..1000032180888.11 rows=184121808 width=84)
  ->  XN Window  (cost=1000027117538.39..1000030799974.55 rows=184121808 width=84)
        Partition: id
        Order: version
        ->  XN Sort  (cost=1000027117538.39..1000027577842.91 rows=184121808 width=84)
              Sort Key: id, version
              ->  XN Seq Scan on table  (cost=0.00..1841218.08 rows=184121808 width=84)

postgresql（redshift）特定列的最大值

3 个答案:

原始方法

a_horse_with_no_name的方法

梅林的方法