Question

我试图找出最好的方法，（在这种情况下可能无关紧要）根据标志的存在找到一个表的行，并在另一个表的行中查找关系id

这是模式：

    CREATE TABLE files (
id INTEGER PRIMARY KEY,
dirty INTEGER NOT NULL);

    CREATE TABLE resume_points (
id INTEGER PRIMARY KEY  AUTOINCREMENT  NOT NULL ,
scan_file_id INTEGER NOT NULL );

我正在使用SQLite3

文件表会非常大，通常为10K-5M行。 resume_points将小于＆lt; 10K，只有1-2个不同的scan_file_id

所以我的第一个想法是：

select distinct files.* from resume_points inner join files
on resume_points.scan_file_id=files.id where files.dirty = 1;

一位同事建议转过来：

select distinct files.* from files inner join resume_points
on files.id=resume_points.scan_file_id where files.dirty = 1;

然后我想，因为我们知道不同的scan_file_id的数量会很小，也许子选择是最优的（在这种罕见的情况下）：

select * from files where id in (select distinct scan_file_id from resume_points);

explain输出具有以下行：分别为42,42和48。

Answer 1

TL; DR：最佳查询和索引是：

create index uniqueFiles on resume_points (scan_file_id);
select * from (select distinct scan_file_id from resume_points) d join files on d.scan_file_id = files.id and files.dirty = 1;

由于我通常使用SQL Server，起初我认为查询优化器肯定会找到这种简单查询的最佳执行计划，无论您编写这些等效SQL语句的方式如何。所以我下载了SQLite，并开始玩游戏。令我惊讶的是，性能存在巨大差异。

以下是设置代码：

CREATE TABLE files (
id INTEGER PRIMARY KEY autoincrement,
dirty INTEGER NOT NULL);

CREATE TABLE resume_points (
id INTEGER PRIMARY KEY  AUTOINCREMENT  NOT NULL ,
scan_file_id INTEGER NOT NULL );

insert into files (dirty) values (0);
insert into files (dirty) select (case when random() < 0 then 1 else 0 end) from files;
insert into files (dirty) select (case when random() < 0 then 1 else 0 end) from files;
insert into files (dirty) select (case when random() < 0 then 1 else 0 end) from files;
insert into files (dirty) select (case when random() < 0 then 1 else 0 end) from files;
insert into files (dirty) select (case when random() < 0 then 1 else 0 end) from files;
insert into files (dirty) select (case when random() < 0 then 1 else 0 end) from files;
insert into files (dirty) select (case when random() < 0 then 1 else 0 end) from files;
insert into files (dirty) select (case when random() < 0 then 1 else 0 end) from files;
insert into files (dirty) select (case when random() < 0 then 1 else 0 end) from files;
insert into files (dirty) select (case when random() < 0 then 1 else 0 end) from files;
insert into files (dirty) select (case when random() < 0 then 1 else 0 end) from files;
insert into files (dirty) select (case when random() < 0 then 1 else 0 end) from files;
insert into files (dirty) select (case when random() < 0 then 1 else 0 end) from files;
insert into files (dirty) select (case when random() < 0 then 1 else 0 end) from files;
insert into files (dirty) select (case when random() < 0 then 1 else 0 end) from files;
insert into files (dirty) select (case when random() < 0 then 1 else 0 end) from files;
insert into files (dirty) select (case when random() < 0 then 1 else 0 end) from files;
insert into files (dirty) select (case when random() < 0 then 1 else 0 end) from files;
insert into files (dirty) select (case when random() < 0 then 1 else 0 end) from files;
insert into files (dirty) select (case when random() < 0 then 1 else 0 end) from files;
insert into files (dirty) select (case when random() < 0 then 1 else 0 end) from files;
insert into files (dirty) select (case when random() < 0 then 1 else 0 end) from files;
insert into files (dirty) select (case when random() < 0 then 1 else 0 end) from files;

insert into resume_points (scan_file_id) select (select abs(random() % 8000000)) from files limit 5000;

insert into resume_points (scan_file_id) select (select abs(random() % 8000000)) from files limit 5000;

我考虑过两个指数：

create index dirtyFiles on files (dirty, id);
create index uniqueFiles on resume_points (scan_file_id);
create index fileLookup on files (id);

以下是我尝试过的查询以及i5笔记本电脑上的执行时间。数据库文件大小只有大约200MB，因为它没有任何其他数据。

select distinct files.* from resume_points inner join files on resume_points.scan_file_id=files.id where files.dirty = 1;
4.3 - 4.5ms with and without index

select distinct files.* from files inner join resume_points on files.id=resume_points.scan_file_id where files.dirty = 1;
4.4 - 4.7ms with and without index

select * from (select distinct scan_file_id from resume_points) d join files on d.scan_file_id = files.id and files.dirty = 1;
2.0 - 2.5ms with uniqueFiles
2.6-2.9ms without uniqueFiles

select * from files where id in (select distinct scan_file_id from resume_points) and dirty = 1;
2.1 - 2.5ms with uniqueFiles
2.6-3ms without uniqueFiles

SELECT f.* FROM resume_points rp INNER JOIN files f on rp.scan_file_id = f.id
WHERE f.dirty = 1 GROUP BY f.id
4500 - 6190 ms with uniqueFiles
8.8-9.5 ms without uniqueFiles
    14000 ms with uniqueFiles and fileLookup

select * from files where exists (
select * from resume_points where files.id = resume_points.scan_file_id) and dirty = 1;
8400 ms with uniqueFiles
7400 ms without uniqueFiles

看起来SQLite的查询优化器根本不是很先进。最好的查询首先将resume_points减少到少量行（在测试用例中为两行.OP表示它将是1-2。），然后查找文件以查看它是否脏。 dirtyFiles索引对任何文件都没有太大影响。我想这可能是因为数据在测试表中的排列方式。它可能会对生产表产生影响。然而，差异不是太大，因为将少于少数几个查找。 uniqueFiles确实有所作为，因为它可以将10000行resume_points减少到2行而不扫描其中的大多数。 fileLookup确实会稍快一些查询，但不足以显着改变结果。值得注意的是，它组成的速度很慢。总之，尽早减少结果集以产生最大的差异。

Answer 2

由于files.id是主键，请尝试GROUP BY此字段，而不是检查DISTINCT files.*

SELECT f.*
FROM resume_points rp
INNER JOIN files f on rp.scan_file_id = f.id
WHERE f.dirty = 1
GROUP BY f.id

考虑性能的另一个选择是向resume_points.scan_file_id添加索引。

CREATE INDEX index_resume_points_scan_file_id ON resume_points (scan_file_id)

Answer 3

您可以尝试exists，这不会产生任何重复files：

select * from files
where exists (
    select * from resume_points 
    where files.id = resume_points.scan_file_id
)
and dirty = 1;

当然可能有助于获得正确的索引：

files.dirty
resume_points.scan_file_id

索引是否有用取决于您的数据。

Answer 4

我认为jtseng给出了解决方案。

select * from (select distinct scan_file_id from resume_points) d
join files on d.scan_file_id = files.id and files.dirty = 1

基本上它和你发布的最后一个选项相同：

select * from files where id in (select distinct scan_file_id from resume_points) and dirty = 1;

因为你必须避免全表扫描/加入。

首先，你需要1-2个不同的ID：

select distinct scan_file_id from resume_points

之后只需要将1-2行连接到另一个表而不是所有10K，这样可以优化性能。

如果您需要多次此语句，我会将其放入视图中。该视图不会改变性能，但它看起来更清晰/更容易阅读。

还要检查查询优化文档：http://www.sqlite.org/optoverview.html

Answer 5

如果表“resume_points”只有一个或两个不同的文件ID号，它似乎只需要一行或两行，并且似乎需要scan_file_id作为主键。该表只有两列，id号无意义。

如果，则不需要任何ID号。

pragma foreign_keys = on;
CREATE TABLE resume_points (
  scan_file_id integer primary key
);

CREATE TABLE files (
  scan_file_id integer not null references resume_points (scan_file_id),
  dirty INTEGER NOT NULL,
  primary key (scan_file_id, dirty)
);

现在你也不需要加入。只需查询“文件”表。

SQLite3查询优化连接与子选择

5 个答案: