postgresql(redshift)特定列的最大值

时间:2014-05-13 10:37:28

标签: sql group-by max amazon-redshift

我正在进行红移 - 我有一张像

这样的表
userid  oid version number_of_objects
1       ab  1       10
1       ab  2       20
1       ab  3       17
1       ab  4       16
1       ab  5       14
1       cd  1       5
1       cd  2       6
1       cd  3       9
1       cd  4       12
2       ef  1       4
2       ef  2       3
2       gh  1       16
2       gh  2       12
2       gh  3       21

我想从此表中选择每个oid的最大版本号,并获取userid和行号。

当我尝试这个时,不幸的是我已经把整张桌子拿回来了:

SELECT MAX(version), oid, userid, number_of_objects
FROM table
GROUP BY oid, userid, number_of_objects
LIMIT 10;

但真正的结果,我正在寻找的是:

userid  oid MAX(version)    number_of_objects
1       ab  5               14
1       cd  4               12
2       ef  2               3
2       gh  3               21

以某种方式明显不起作用,它说:

  

不支持SELECT DISTINCT ON

你有什么想法吗?


更新:与此同时,我想出了这个解决方法,但我觉得这不是最聪明的解决方案。它也很慢。但它至少起作用。以防万一:

SELECT * FROM table,
   (SELECT MAX(version) as maxversion, oid, userid
    FROM table
    GROUP BY oid, userid
    ) as maxtable
    WHERE  table.oid = maxtable.oid
   AND table.userid = maxtable.userid
   AND table.version = maxtable.version
LIMIT 100;

你有更好的解决方案吗?

3 个答案:

答案 0 :(得分:7)

如果redshift有窗函数,你可以试试这个:

SELECT * 
FROM (
  select oid, 
         userid, 
         version,
         max(version) over (partition by oid, userid) as max_version, 
  from the_table
) t
where version = max_version;

我希望它比使用group by的自联接更快。

另一种选择是使用row_number()函数:

SELECT * 
FROM (
  select oid, 
         userid, 
         version,
         row_number() over (partition by oid, userid order by version desc) as rn, 
  from the_table
) t
where rn = 1;

这个问题更多的是个人品味问题。表现明智,我不希望有任何区别。

答案 1 :(得分:0)

select      distinct
            first_value(userid) over(
                  partition by oid 
                  order by version desc
                  rows between unbounded preceding and unbounded following
                  ) as userid
            , oid
            , first_value(version) over(
                  partition by oid
                  order by version desc
                  rows between unbounded preceding and unbounded following
                  ) as max_version
            , first_value(number_of_objects) over(
                  partition by oid
                  order by version desc
                  rows between unbounded preceding and unbounded following
                  ) as number_of_objects

from        table
order by    oid;

AWS Redshift Documentation first_value

如果nulls last可为空,请不要忘记顺序中的version

答案 2 :(得分:0)

长话短说:骑马。

作者的方法应该在较小的表上更快并且提取示例数据,但是窗口方法在性能上将更加一致,并且在整个表上将更快。

以下是我在桌子上做的一些解释性结果,该结果具有17列,184 121 798行和12 809 740个唯一ID(每个ID平均14个版本,但最多可以有40个版本)。

快速摘要:

Tomi的做法:cost = 5983958.76..67801689853856.94(第一行6 * 10 ^ 6,整个表格7 * 10 ^ 13)

@a_horse_with_no_name方法:cost = 1000027117538.39..1000031720583.59(任何查询10 ^ 12)

@Merlin:与上述方法几乎完全相同。

原始方法

explain
SELECT * FROM table t,
(SELECT MAX(version) as maxversion, id
 FROM table
 GROUP BY id
) as maxtable
WHERE  t.id = maxtable.id
       AND t.version = maxtable.maxversion;
XN Hash Join DS_DIST_NONE  (cost=5983958.76..67801689853856.94 rows=63811541 width=590)
  Hash Cond: ((("outer".id)::text = ("inner".id)::text) AND ("outer".version = "inner".maxversion))
  ->  XN Seq Scan on equipment_visits ev  (cost=0.00..1841218.08 rows=184121808 width=418)
  ->  XN Hash  (cost=5063349.72..5063349.72 rows=184121808 width=172)
        ->  XN Subquery Scan maxtable  (cost=2761827.12..5063349.72 rows=184121808 width=172)
              ->  XN HashAggregate  (cost=2761827.12..3222131.64 rows=184121808 width=44)
                    ->  XN Seq Scan on equipment_visits  (cost=0.00..1841218.08 rows=184121808 width=44)

因此,第一行和所有行的成本分别为5983958.76(6 * 10 ^ 6)和67801689853856.94(7 * 10 ^ 13)。

a_horse_with_no_name的方法

@a_horse_with_no_name提供的两个解决方案都有几乎完全一样的计划,因此我将仅粘贴其中一个

explain
SELECT * 
FROM (
  select *,
         row_number() over (partition by id order by version desc) as rn
  from table
)
where rn = 1;

给予

  Filter: (rn = 1)
  ->  XN Window  (cost=1000027117538.39..1000029419060.99 rows=184121808 width=44)
        Partition: id
        Order: version
        ->  XN Sort  (cost=1000027117538.39..1000027577842.91 rows=184121808 width=44)
              Sort Key: id, version
              ->  XN Seq Scan on table  (cost=0.00..1841218.08 rows=184121808 width=44)

梅林的方法

@Merlin提供的解决方案似乎不完整,因为它没有返回最新版本的所有值,但其性能与第二种选择相似

explain
select      distinct
              id
            , first_value(version) over(
                  partition by id
                  order by version desc
                  rows between unbounded preceding and unbounded following
                  ) as max_version
            , first_value(additional_col) over(
                  partition by id
                  order by version desc
                  rows between unbounded preceding and unbounded following
                  ) as additional_col

from        table t;

给予

XN Unique  (cost=1000027117538.39..1000032180888.11 rows=184121808 width=84)
  ->  XN Window  (cost=1000027117538.39..1000030799974.55 rows=184121808 width=84)
        Partition: id
        Order: version
        ->  XN Sort  (cost=1000027117538.39..1000027577842.91 rows=184121808 width=84)
              Sort Key: id, version
              ->  XN Seq Scan on table  (cost=0.00..1841218.08 rows=184121808 width=84)