postgresql在数组中使用大数据的性能

时间:2015-01-14 14:35:03

标签: performance postgresql

postgresql server 9.1

横幅(40K行)和事件(140M行) - 包含客户端数据的表。 client_id row是客户端的整数id,上面有索引。

第一次查询:

SELECT DISTINCT client_id
FROM events 
WHERE type = 'banner_show' AND client_id IN (select distinct client_id from banners)

工作约23秒

第二个查询:

SELECT DISTINCT client_id
FROM events 
WHERE type = 'banner_show' AND client_id IN (1, 2, 3, 4...)

其中" 1,2,3,4 ......" - 查询结果"从横幅"中选择不同的client_id。 第二次查询工作大约10分钟,直到我停止它。 为什么具有相同数据的查询的性能有如此显着的差异?

解释分析(第一次查询):

EXPLAIN ANALYZE
SELECT DISTINCT client_id 
FROM events 
WHERE type = 'banner_show' AND client_id IN (select distinct client_id from banners)

"HashAggregate  (cost=4481767.32..4481767.74 rows=42 width=4) (actual        time=24726.275..24727.259 rows=8572 loops=1)"
"  ->  Hash Join  (cost=1954.16..4481542.58 rows=89895 width=4) (actual time=16052.849..24698.907 rows=68770 loops=1)"
"        Hash Cond: (events.client_id = banners.client_id)"
"        ->  Seq Scan on events  (cost=0.00..4476744.47 rows=179790 width=4) (actual time=16037.562..24634.461 rows=69272 loops=1)"
"              Filter: ((type)::text = 'banner_show'::text)"
"        ->  Hash  (cost=1767.58..1767.58 rows=14926 width=4) (actual time=15.258..15.258 rows=13923 loops=1)"
"              Buckets: 2048  Batches: 1  Memory Usage: 490kB"
"              ->  HashAggregate  (cost=1469.06..1618.32 rows=14926 width=4) (actual time=12.421..13.805 rows=13923 loops=1)"
"                    ->  Seq Scan on banners  (cost=0.00..1369.45 rows=39845 width=4) (actual time=0.005..6.883 rows=38184 loops=1)"
"Total runtime: 24727.909 ms"

解释分析(第二个查询):

"HashAggregate  (cost=842924414.03..842924414.17 rows=14 width=4) (actual time=1521873.754..1521874.796 rows=8574 loops=1)    "
"  ->  Bitmap Heap Scan on events  (cost=534167.70..842924261.77 rows=60905 width=4) (actual time=260305.233..1521811.644 rows=68782 loops=1)    "
"        Recheck Cond: (client_id = ANY ('{153566,171259,151232,155132,160170,162720,152159,166302,175899,158611,}'::integer[]))    "
"        Filter: ((type)::text = 'banner_show'::text)    "
"        ->  Bitmap Index Scan on ix_events_client_id  (cost=0.00..534152.47 rows=48209684 width=0) (actual time=4916.828..4916.828 rows=5345417 loops=1)    "
"              Index Cond: (client_id = ANY ('{153566,171259,151232,155132,......}'::integer[]))    "
"Total runtime: 1521875.137 ms    "

表shemas:

CREATE TABLE banners
(
  id serial NOT NULL,
  type_id integer,
  form_id integer,
  banner character varying,
  client_id integer,
  created timestamp without time zone,
  deleted timestamp without time zone,
  CONSTRAINT banners_pkey PRIMARY KEY (id)
)
WITH (
  OIDS=FALSE
);
ALTER TABLE banners
  OWNER TO postgres;

CREATE INDEX ix_banners_client_id
  ON banners
  USING btree
  (client_id);


CREATE TABLE events
(
  id serial NOT NULL,
  time_created timestamp without time zone,
  type character varying,
  date timestamp without time zone,
  param character varying,
  client_id integer,
  hash_id character varying,
  CONSTRAINT events_pkey PRIMARY KEY (id)
)
WITH (
  OIDS=FALSE
);
ALTER TABLE events
  OWNER TO postgres;

CREATE INDEX ix_events_client_id
  ON events
  USING btree
  (client_id);

CREATE INDEX ix_events_hash_id
  ON events
  USING btree
  (hash_id COLLATE pg_catalog."default");

1 个答案:

答案 0 :(得分:0)

当您的过滤条件有两列时,您必须创建一个索引来覆盖它们,请参阅

CREATE INDEX event_client_show_idx 
ON events
USING btree (client_id, type);

第一个选择+解释

EXPLAIN
SELECT DISTINCT client_id
FROM events 
WHERE client_id IN (1, 2, 3, 4) AND type = 'banner_show';

返回类似的内容:

Unique  (cost=0.15..8.65 rows=1 width=4)
  ->  Index Only Scan using event_client_show_idx on events  (cost=0.15..8.65 rows=1 width=4)
        Index Cond: ((client_id = ANY ('{1,2,3,4}'::integer[])) AND (type = 'banner_show'::text))

http://use-the-index-luke.com/

的Markus Winand博客上阅读有关索引的更多信息