我在PostgreSQL中有一个带有大数据的数据库(现在它大约是46 GB,数据库将继续增长)。我在常用列上创建了索引并调整了配置文件:
shared_buffers = 1GB
temp_buffers = 256MB
work_mem = 512MB
但是这个查询仍然很慢:
select distinct us_category_id as cat, count(h_user_id) as res from web_hits
inner join users on h_user_id = us_id
where (h_datetime)::date = ('2015-06-26')::date and us_category_id != ''
group by us_category_id
解释分析:
HashAggregate (cost=2870958.72..2870958.93 rows=21 width=9) (actual time=899141.683..899141.683 rows=0 loops=1) Group Key: users.us_category_id, count(web_hits.h_user_id) -> HashAggregate (cost=2870958.41..2870958.62 rows=21 width=9) (actual time=899141.681..899141.681 rows=0 loops=1) Group Key: users.us_category_id -> Hash Join (cost=5974.98..2869632.11 rows=265259 width=9) (actual time=899141.679..899141.679 rows=0 loops=1) Hash Cond: ((web_hits.h_user_id)::text = (users.us_id)::text) -> Seq Scan on web_hits (cost=0.00..2857563.80 rows=275260 width=7) (actual time=899141.676..899141.676 rows=0 loops=1) -> Seq Scan on web_hits (cost=0.00..2857563.80 rows=275260 width=7) (actual time=899141.676..899141.676 rows=0 loops=1) Filter: ((h_datetime)::date = '2015-06-26'::date) Rows Removed by Filter: 55051918 -> Hash (cost=4292.99..4292.99 rows=134559 width=10) (never executed) -> Seq Scan on users (cost=0.00..4292.99 rows=134559 width=10) (never executed) Filter: ((us_category_id)::text <> ''::text) "Planning time: 1.309 ms" "Execution time: 899141.789 ms"
日期已更改。 如何加快查询速度?
创建表和索引
CREATE TABLE web_hits (
h_id integer NOT NULL DEFAULT nextval('w_h_seq'::regclass),
h_user_id character varying,
h_datetime timestamp without time zone,
h_db_id character varying,
h_voc_prefix character varying,
...
h_bot_chek integer, -- 1-бот...
CONSTRAINT w_h_pk PRIMARY KEY (h_id)
);
ALTER TABLE web_hits OWNER TO postgres;
COMMENT ON COLUMN web_hits.h_bot_chek IS '1-бот, 0-не бот';
CREATE INDEX h_datetime ON web_hits (h_datetime);
CREATE INDEX h_db_index ON web_hits (h_db_id COLLATE pg_catalog."default");
CREATE INDEX h_pref_index ON web_hits (h_voc_prefix COLLATE pg_catalog."default" text_pattern_ops);
CREATE INDEX h_user_index ON web_hits (h_user_id text_pattern_ops);
CREATE TABLE users (
us_id character varying NOT NULL,
us_category_id character varying,
...
CONSTRAINT user_pk PRIMARY KEY (us_id),
CONSTRAINT cities_users_fk FOREIGN KEY (us_city_home)
REFERENCES cities (city_id),
CONSTRAINT countries_users_fk FOREIGN KEY (us_country_home)
REFERENCES countries (country_id),
CONSTRAINT organizations_users_fk FOREIGN KEY (us_institution_id)
REFERENCES organizations (org_id),
CONSTRAINT specialities_users_fk FOREIGN KEY (us_speciality_id)
REFERENCES specialities (speciality_id),
CONSTRAINT us_affiliation FOREIGN KEY (us_org_id)
REFERENCES organizations (org_id),
CONSTRAINT us_category FOREIGN KEY (us_category_id)
REFERENCES categories (cat_id),
CONSTRAINT us_reading_room FOREIGN KEY (us_reading_room_id)
REFERENCES reading_rooms (rr_id)
);
ALTER TABLE users OWNER TO sveta;
COMMENT ON COLUMN users.us_type IS '0-аноним, 1-читатель, 2-удаленный';
CREATE INDEX us_cat_index ON users (us_category_id);
CREATE INDEX us_user_index ON users (us_id text_pattern_ops);
答案 0 :(得分:2)
首先,不需要区别:
select u.us_category_id as cat, count(h_user_id) as res
from web_hits h inner join
users u
on h.h_user_id = u.us_id
where (h.h_datetime)::date = '2015-06-26'::date and
u.us_category_id <> ''
group by u.us_category_id
其次,您要删除列上的转换。所以:
select u.us_category_id as cat, count(h_user_id) as res
from web_hits h inner join
users u
on h.h_user_id = u.us_id
where (h.h_datetime >= '2015-06-26' and h.h_datetime < '2015-06-27) and
u.us_category_id <> ''
group by u.us_category_id;
然后,以下索引应该有助于查询:web_hits(h_datetime, h_user_id)
。在users(us_id, us_category_id)
上建立索引可能也是有益的。
答案 1 :(得分:2)
问题中缺少基本信息。我将基于有根据的猜测来回答我的部分答案。
web_hits.h_user_id
有时为NULL,就像您在评论中添加的那样。
基本上,在任何情况下都可以简化/改进查询:
SELECT u.us_category_id AS cat, count(*) AS res
FROM users u
JOIN web_hits w ON w.h_user_id = u.us_id
WHERE w.h_datetime >= '2015-06-26 0:0'::timestamp
AND w.h_datetime < '2015-06-27 0:0'::timestamp
AND w.h_user_id IS NOT NULL -- remove irrelevant rows, match index
AND u.us_category_id <> ''
GROUP BY 1;
DISTINCT
显然是不必要的,因为您已经group by us_category_id
(例如@Gordon already mentioned)。
创建条件sargable以便可以使用索引:
由于您已加入列w.h_user_id
,因此逻辑上遵循此列中的结果行为NOT NULL
。 count(*)
在这种情况下是等效的,速度要快一些。
条件h_user_id IS NOT NULL
似乎是多余的,因为无论如何在JOIN
中消除了NULL,但它允许使用具有匹配条件的部分索引(见下文)。
users.us_id
(因此web_hits.h_user_id
)可能不应该具有数据类型varchar
(character varying
)。对于庞大的表中的PK / FK列,这是一种低效的数据类型。使用数字数据类型,例如int
or bigint
(如果必须,请使用uuid
)。
us_category_id
的类似注意事项:应为integer
或相关。
标准SQL不等式运算符为<>
。使用它而不是也支持!=
。
使用表格资格来避免含糊不清 - 并且无论如何都要在公共论坛中向读者明确查询。
进一步假设:
users.us_category_id <> ''
适用于大多数行。web_hits.h_user_id IS NOT NULL
的大部分或全部行。然后这会更快,但是:
SELECT u.us_category_id AS cat, sum(ct) AS res
FROM users u
JOIN (
SELECT h_user_id, count(*) AS ct
FROM web_hits
WHERE h_datetime >= '2015-06-26 0:0'::timestamp
AND h_datetime < '2015-06-27 0:0'::timestamp
AND h_user_id IS NOT NULL -- remove irrelevant rows, match index
GROUP BY 1
) w ON w.h_user_id = u.us_id
AND u.us_category_id <> ''
GROUP BY 1;
无论哪种方式,partial indexes最适合您的情况:
1
CREATE INDEX wh_usid_datetime_idx ON web_hits(h_user_id, h_datetime)
WHERE h_user_id IS NOT NULL;
消除索引中web_hits.h_user_id IS NULL
的行。
列按此顺序,而不是像其他方式一样建议。详细解释:
2
CREATE INDEX us_usid_cat_not_empty_idx ON users(us_id)
WHERE us_category_id <> '';
这会相当小,因为我们不会在索引中存储可能冗长的varchar
列us_category_id
- 无论如何我们都不需要这个列。我们只需要了解它<> ''
。如果您有integer
列,则此考虑将不适用。
我们还在''
中排除NULL
或us_category_id
的行,使索引更小。
您必须权衡特殊指数的维护成本与其收益。如果您运行具有匹配条件的查询很多,它将支付,否则,它可能不会,并且更一般的索引可能总体上更好。
当然,关于performance optimization的所有常见建议也适用。
坦率地说,您的查询没有太多正确的,并且您的设置中有很多项可疑。处理像你这样的大桌子,你可能会考虑专业的帮助。