查找特定年份奖励数最多的电影 - 代码重复

时间:2013-11-16 17:41:30

标签: sql postgresql aggregate-functions window-functions

我正在尝试编写一个查询(PostgreSQL)以获得“2012年奖项数量最多的电影。”

我有以下表格:

CREATE TABLE Award(
    ID_AWARD bigserial CONSTRAINT Award_pk PRIMARY KEY,
    award_name VARCHAR(90),
    category VARCHAR(90),
    award_year integer,
    CONSTRAINT award_unique UNIQUE (award_name, category, award_year));

CREATE TABLE AwardWinner(
    ID_AWARD integer,
    ID_ACTOR integer,
    ID_MOVIE integer,
    CONSTRAINT AwardWinner_pk PRIMARY KEY (ID_AWARD));

我写了下面的查询,它给出了正确的结果,但我认为有很多代码重复。

select * from 
(select id_movie, count(id_movie) as awards 
from Award natural join awardwinner 
where award_year = 2012 group by id_movie) as SUB
where awards = (select max(count) from 
(select id_movie, count(id_movie) 
from Award natural join awardwinner 
where award_year = 2012 group by id_movie) as SUB2);

所以SUBSUB2是完全相同的子查询。有更好的方法吗?

3 个答案:

答案 0 :(得分:7)

您可以使用common table expression来避免代码重复:

with cte_s as (
   select id_movie, count(id_movie) as awards
   from Award natural join awardwinner 
   where award_year = 2012
   group by id_movie
)
select
    sub.id_movie, sub.awards
from cte_s as sub
where sub.awards = (select max(sub2.awards) from cte_s as sub2)

或者您可以使用window function执行此类操作(未经测试,但我认为PostgreSQL允许这样做):

with cte_s as (
    select
        id_movie,
        count(id_movie) as awards,
        max(count(id_movie)) over() as max_awards
    from Award natural join awardwinner 
    where award_year = 2012
    group by id_movie
)
select id_movie
from cte_s
where max_awards = awards

另一种方法是使用rank()函数(未经测试,可能需要使用两个cte而不是一个):

with cte_s as (
    select
        id_movie,
        count(id_movie) as awards,
        rank() over(order by count(id_movie) desc) as rnk
    from Award natural join awardwinner 
    where award_year = 2012
    group by id_movie
)
select id_movie
from cte_s
where rnk = 1

更新当我创建此答案时,我的主要目标是展示如何使用cte来避免代码重复。在一般情况下,如果可能的话,最好避免在查询中多次使用cte - 第一次查询使用2次表扫描(或索引搜索)而第二次和第三次只使用一次,所以我应该指定最好使用这些查询。无论如何,@ Erwin在他的回答中做了这个测试。只是为了增加他的重要观点:

  • 我也反对natural join,因为这容易出错。实际上,我的主要RDBMS是SQL Server,它不支持它,所以我更习惯于显式outer/inner join
  • 在查询中始终使用别名是一个好习惯,因此您可以避免使用strange results
  • 这可能是完全主观的,但通常如果我只使用某些表来过滤掉查询主表中的行(就像在这个查询中一样,我们只想获得2012年的awards和只是过滤来自awardwinner的行,我不想使用join,而是使用existsin代替,对我来说似乎更合乎逻辑。
所以最终的查询可能是:
with cte_s as (
    select
        aw.id_movie,
        count(*) as awards,
        rank() over(order by count(*) desc) as rnk
    from awardwinner as aw
    where
        exists (
            select *
            from award as a
            where a.id_award = aw.id_award and a.award_year = 2012
        )
    group by aw.id_movie
)
select id_movie
from cte_s
where rnk = 1

答案 1 :(得分:2)

获取所有获奖电影

SELECT id_movie, awards
FROM  (
   SELECT aw.id_movie, count(*) AS awards
         ,rank() OVER (ORDER BY count(aw.id_movie) DESC) AS rnk
   FROM   award       a
   JOIN   awardwinner aw USING (id_award)
   WHERE  a.award_year = 2012
   GROUP  BY aw.id_movie
   ) sub
WHERE  rnk = 1;

重点

  • 这应该比目前为止的建议更简单,更快捷。使用EXPLAIN ANALYZE进行测试。

  • 有些情况下,CTE有助于避免代码重复。但不是在这个时间:子查询可以很好地完成工作并且通常更快。

  • 您可以在同一查询级别上运行聚合函数的窗口函数。这就是为什么这样做的原因:

    rank() OVER (ORDER BY count(aw.id_movie) DESC) AS rnk
    
  • 我建议在JOIN条件中使用显式列名而不是NATURAL JOIN,如果稍后更改/添加列到基础表,则容易发生破坏。 USING的JOIN条件几乎一样短,但不会轻易破坏。

  • 由于id_movie不能为NULL(由JOIN条件排除,也是pk的一部分),因此使用count(*)会更短,速度更快。同样的结果。

只有一部电影

更短,更快,但是,如果您只需要一个获胜者:

SELECT aw.id_movie, count(*) AS awards
FROM   award       a
JOIN   awardwinner aw USING (id_award)
WHERE  a.award_year = 2012
GROUP  BY 1
ORDER  BY 2 DESC, 1 -- as tie breaker
LIMIT  1

在此处使用位置参考(12)作为速记 我将id_movie添加到ORDER BY作为决胜局,以防多部电影有资格获胜。

答案 2 :(得分:0)

你不需要这样的东西吗?

SELECT ID_MOVIE, COUNT(*)
FROM AwardWinner
JOIN Award ON Award.ID_AWARD = AwardWinner.ID_AWARD
WHERE award_year = 2012
GROUP BY ID_MOVIE
ORDER BY COUNT(*) DESC

或者可能(取决于你要找的东西):

SELECT ID_MOVIE, COUNT(DISTINCT AwardWinner.ID_AWARD)
FROM AwardWinner
JOIN Award ON Award.ID_AWARD = AwardWinner.ID_AWARD
WHERE award_year = 2012
GROUP BY ID_MOVIE
ORDER BY COUNT(*) DESC