Question

＃greatest-n-per-group派对的另一人！

我之前的代码：

select count(*)
  from revisions join files on rev_file = file_id
 where rev_parent_id like 0
   and rev_timestamp between '20011231230000' and '20191231225959'
   and file_namespace like 0
   and file_is_redirect like 0

问题是，对于某些文件，存在具有rev_parent_id = 0的多个条目。我只想计算那些具有最早rev_timestamp的人，但是我尝试使用SQL select only rows with max value on a column和Select Earliest Date and Time from List of Distinct User Sessions中的答案给我cca 9 000和11 000000。正确的数字应该是cca 422000。也许我未能正确连接三个表，这是我的尝试之一（该结果为9 000）：

select count(r1.rev_file) 
  from revisions r1
  left outer join revisions r2 on (r1.rev_file = r2.rev_file
                              and r1.rev_timestamp < r2.rev_timestamp) 
  join files on r1.rev_file = file_id 
 where r2.rev_file is NULL
   and r1.rev_parent_id like 0 
   and r1.rev_timestamp between '20011231230000' and '20191231225959' 
   and file_namespace like 0
   and file_is_redirect like 0

表结构：

files
file_id, file_namespace, file_is_redirect
1234, 0, 0
1235, 3, 1
1236, 3, 0

revisions
rev_file, rev_id, rev_parent_id, rev_timestamp
1234, 19, 16, 20170302061522
1234, 16, 0, 20170302061428
1234, 14, 12, 20170302061422
1234, 12, 0, 20170302061237
1235, 21, 18, 20170302061815
1235, 18, 13, 20170302061501
1235, 13, 8, 20170302061355
1235, 8, 3, 20170302061213
1235, 3, 0, 20170302061002
1236, 6, 0, 20170302061014

file_id = rev_file =文件的ID。 file_namespace =文件的模仿类型，0为纯文本。 rev_id =修订版的ID。 rev_parent_id =父修订版的ID。 rev_timestamp =修订的时间戳

唯一有效的文件是1234，并且已将其删除并重新创建，因此它具有两个rev_parent_id = 0条目。我只想在较早的rev_parent_id = 0修订是在选定的时间之间计数文件。

Answer 1

您应该为re_file加入min rev_timestamp的子查询

    select count(*) 
    from revisions 
    join files on rev_file = file_id 
    join  (

        select rev_file, min(rev_timestamp) min_time
        from revisions
        where rev_parent_id = 0 
        group  by rev_file

    ) t on t.min_time  = revisions.rev_timestamp 
            and t.rev_file = revisions.rev_file
    where rev_parent_id like 0 
    and rev_timestamp between '20011231230000' and '20191231225959' 
    and file_namespace like 0 
    and file_is_redirect like 0

Answer 2

首先，让我们使用子查询为每个revisions找到rev_file中最早的时间戳，以满足您的条件。

          SELECT MIN(rev_timestamp) rev_timestamp, rev_file
            FROM revisons
           WHERE rev_parent_id like 0 
             AND rev_timestamp between '20011231230000' and '20191231225959' 
           GROUP BY rev_file

这为您提供了一个虚拟表，该表具有符合条件的每个文件的最早时间戳。

接下来，像这样将该表与其他表连接

SELECT COUNT(*) count
  FROM revisions r1
  JOIN (
          SELECT MIN(rev_timestamp) rev_timestamp, rev_file
            FROM revisons
           WHERE rev_parent_id like 0 
             AND rev_timestamp between '20011231230000' and '20191231225959' 
           GROUP BY rev_file
       ) rmin ON r1.rev_timstamp = rmin.rev_timestamp
             AND r1.rev_file = rmin.rev_file
  JOIN files f ON r1.rev_file = file_id
   and f.file_namespace like 0
   and f.file_is_redirect like 0

专业提示：总是值得格式化您的查询以使其易于阅读。

专业提示：尽可能使用COUNT(*)而不是COUNT(col)。它更快。并且，除非您提到的col可能包含NULL值，否则它会产生相同的结果。问题中的查询不是这种情况。

专业提示：始终在JOIN操作中限制列（f.file_is_redirect，而不是file_is_redirect）。同样，查询的可读性是动机。如果您有幸有一天有其他人来维护您的代码，那么该人将很高兴看到这一点。这是“专业和爱好者”编程的重要组成部分。

专业提示：numeric_col LIKE 0会降低性能。它用于匹配文本（column LIKE '%verflo'匹配Stack Overflow）。当您在数字列上使用LIKE时，它将把每一列的数据类型强制为字符串，然后在其上运行LIKE运算符，从而使您无法使用数字列上的任何索引。

Answer 3

谢谢你们@scaisedge和@ o-jones，最后我使用了两个答案的核心并删除了多余的代码，这最终对我有用：

select count(*)
  from (select rev_file, min(rev_timestamp) rev_timestamp from revision where rev_parent_id like 0 group by rev_file) revision
  join file on rev_file = file_id
 where rev_timestamp between '20011231230000' and '20191231225959'
   and file_namespace like 0
   and not file_is_redirect;

也许我也可以通过将file_namespace和file_is_redirect条件移到join中的另一个子查询中来节省一些运行时间，但是也许不，我不确定。

scaisedge答案更简短，可读性更好，因此我立即理解并喜欢它。 scaisedge只是在代码中有一些错误（由我修复）。 o-jones答案的内容杂乱无章，但更为详尽，以防万一任何读者需要解释，并且感谢改进的技巧，我了解了我的代码中的一些计时问题。

从联接表中选择带有rev_parent_id = 0的第一个条目

3 个答案: