从数据库中选择具有给定姓氏的人

时间:2018-12-31 11:30:27

标签: sqlite

我执行以下操作以获取给定年份的一组区域中的人口:

SELECT Year, County, District, Count(*) FROM census_data group by Year, County, District where Year = ?;

然后,我要进行数千次以下操作,以获取我感兴趣的每个姓氏在每个地区的人口:

SELECT Year, County, District, COUNT(*) FROM census_data where Year = ? and Surname = ? group by Year, County, District;

我的数据库中有800万行,涵盖了两个特定年份。大约有40个县,一个县通常有数百个地区。

我应该在表上添加索引以加快上述查询的速度,如下所示:

CREATE INDEX surname_index ON census_data (surname);

我的想法是,由于一般而言,姓氏的人并不多,因此仅需索引即可。还是您会推荐其他东西?我还可以将查询更改为:

SELECT Year, County, District, COUNT(*) FROM census_data where Surname = ? group by Year, County, District;

因为我通常对这两年都感兴趣。在进行查询时,如何查看我的索引是否正在使用?

1 个答案:

答案 0 :(得分:1)

是的,我会在要分组的列上使用索引。就像我在评论中提到的那样,我还将使用一个查询来生成超过1000个查询的所有所需行,这些查询会产生一个总数的片段。使数据库仅执行一次所有工作。由于您提到的是您感兴趣的名称,所以是1000种最常用的名称,而不是随机名称,这实际上使它变得容易一些。

以下内容演示了两种略有不同的方法来获取总体上最普遍的姓氏中每个(year, county, district, surname)的数量:

首先,用一些示例数据填充表格:

CREATE TABLE census(year INTEGER, county TEXT, district TEXT, surname TEXT);
INSERT INTO census VALUES
       (2012, 'Lake', 'West', 'Smith'),
       (2012, 'Lake', 'West', 'Jones'),
       (2012, 'Lake', 'West', 'Smith'),
       (2012, 'Lake', 'West', 'Washington'),
       (2012, 'Lake', 'West', 'Washington'),
       (2012, 'Lake', 'East', 'Smith'),
       (2012, 'Lake', 'East', 'Jackson'),
       (2012, 'Williams', 'Downtown', 'Jones'),
       (2012, 'Williams', 'Downtown', 'McMaster'),
       (2012, 'Williams', 'West Side', 'Jones'),
       (2012, 'Williams', 'West Side', 'Jones');
CREATE INDEX census_idx ON census(year, county, district, surname);

(当然,您的真实数据将具有更多的行,并可能会有更多的列。根据空间限制,您可能希望从索引中删除姓氏,但要以较慢的查询为代价。使用所有四列在索引中,它是下面查询的覆盖索引,并且从不访问实际的表行,仅前三个(或两个或一个)就需要临时b树分组以及更多的表访问。)

方法一:填充一个临时表,其中总共包含1000个最常用的名称,并在联接中使用该表将结果限制为仅包含这些名称:

CREATE TEMP TABLE names(name TEXT PRIMARY KEY) WITHOUT ROWID;
INSERT INTO names
 SELECT surname FROM census GROUP BY surname ORDER BY count(*) DESC LIMIT 1000;    
SELECT year, county, district, surname, count(*) as number
FROM census AS c
JOIN names AS n ON c.surname = n.name
GROUP BY year, county, district, surname
ORDER BY year, county, district, count(*) DESC, surname;

方法二:做同样的事情,但是使用子查询而不是表来查询最常用的名称:

SELECT year, county, district, surname, count(*) as number
FROM census AS c
JOIN (SELECT surname AS name FROM census GROUP BY surname ORDER BY count(*) DESC LIMIT 1000) AS n ON c.surname = n.name
GROUP BY year, county, district, surname
ORDER BY year, county, district, count(*) DESC, surname;

两种产品:

year        county      district    surname     number    
----------  ----------  ----------  ----------  ----------
2012        Lake        East        Jackson     1         
2012        Lake        East        Smith       1         
2012        Lake        West        Smith       2         
2012        Lake        West        Washington  2         
2012        Lake        West        Jones       1         
2012        Williams    Downtown    Jones       1         
2012        Williams    Downtown    McMaster    1         
2012        Williams    West Side   Jones       2

如果您要在一个会话中大量运行此查询,则第一种方法会更快-它只需构建一次最常用的名称列表,而第二种方法每次查询时都必须执行一次运行。但是,它涉及更多,因为它需要多个SQL语句。当然,对于单次运行,在适当大小的数据集上对两者进行基准测试是最佳指南。