在Postgres中对多个字段进行松散的索引扫描?

时间:2013-12-05 05:50:49

标签: postgresql group-by indexing distinct postgresql-9.2

我在Postgres 9.2(数百万行)中有几个大表,我需要根据两个字段'source'(varchar)和'id'(int)的组合生成一个唯一的代码。我可以通过生成row_numbers来执行此操作:

SELECT source,id FROM tablename GROUP BY source,id

但结果可能需要一段时间才能完成。建议如果字段被索引,并且索引值的比例很小(这是我的情况),那么松散的索引扫描可能是更好的选择:http://wiki.postgresql.org/wiki/Loose_indexscan

WITH RECURSIVE
     t AS (SELECT min(col) AS col FROM tablename
           UNION ALL
           SELECT (SELECT min(col) FROM tablename WHERE col > t.col) FROM t WHERE t.col IS NOT NULL)
SELECT col FROM t WHERE col IS NOT NULL
UNION ALL
SELECT NULL WHERE EXISTS(SELECT * FROM tablename WHERE col IS NULL);

该示例在单个字段上运行。尝试返回多个字段会产生错误:子查询必须只返回一列。一种可能性是尝试检索整个ROW - 例如SELECT ROW(min(source),min(id)...,但后来我不确定WHERE语句的语法需要看起来像处理单个行元素。

问题是:是否可以修改基于递归的代码以使用多个列,如果是,如何?我承诺使用Postgres,但看起来MySQL已经为多个列实现了松散的索引扫描:http://dev.mysql.com/doc/refman/5.1/en/group-by-optimization.html


根据建议,我正在附上我的EXPLAIN ANALYZE结果。

对于我的情况 - 我使用GROUP BY 为2列选择不同的值,它是以下内容:

 HashAggregate  (cost=1645408.44..1654099.65 rows=869121 width=34) (actual time=35411.889..36008.475 rows=1233080 loops=1)
   ->  Seq Scan on tablename  (cost=0.00..1535284.96 rows=22024696 width=34) (actual time=4413.311..25450.840 rows=22025768 loops=1)
 Total runtime: 36127.789 ms
(3 rows)

我不知道如何进行2列索引扫描(这是问题),但为了进行比较,在一列上使用GROUP BY,我得到:

 HashAggregate  (cost=1590346.70..1590347.69 rows=99 width=8) (actual time=32310.706..32310.722 rows=100 loops=1)
   ->  Seq Scan on tablename  (cost=0.00..1535284.96 rows=22024696 width=8) (actual time=4764.609..26941.832 rows=22025768 loops=1)
 Total runtime: 32350.899 ms
(3 rows)

但是对于一列上的松散索引扫描,我得到:

 Result  (cost=181.28..198.07 rows=101 width=8) (actual time=0.069..1.935 rows=100 loops=1)
   CTE t
     ->  Recursive Union  (cost=1.74..181.28 rows=101 width=8) (actual time=0.062..1.855 rows=101 loops=1)
           ->  Result  (cost=1.74..1.75 rows=1 width=0) (actual time=0.061..0.061 rows=1 loops=1)
                 InitPlan 1 (returns $1)
                   ->  Limit  (cost=0.00..1.74 rows=1 width=8) (actual time=0.057..0.057 rows=1 loops=1)
                         ->  Index Only Scan using tablename_id on tablename  (cost=0.00..38379014.12 rows=22024696 width=8) (actual time=0.055..0.055 rows=1 loops=1)
                               Index Cond: (id IS NOT NULL)
                               Heap Fetches: 0
           ->  WorkTable Scan on t  (cost=0.00..17.75 rows=10 width=8) (actual time=0.017..0.017 rows=1 loops=101)
                 Filter: (id IS NOT NULL)
                 Rows Removed by Filter: 0
                 SubPlan 3
                   ->  Result  (cost=1.75..1.76 rows=1 width=0) (actual time=0.016..0.016 rows=1 loops=100)
                         InitPlan 2 (returns $3)
                           ->  Limit  (cost=0.00..1.75 rows=1 width=8) (actual time=0.016..0.016 rows=1 loops=100)
                                 ->  Index Only Scan using tablename_id on tablename  (cost=0.00..12811462.41 rows=7341565 width=8) (actual time=0.015..0.015 rows=1 loops=100)
                                       Index Cond: ((id IS NOT NULL) AND (id > t.id))
                                       Heap Fetches: 0
   ->  Append  (cost=0.00..16.79 rows=101 width=8) (actual time=0.067..1.918 rows=100 loops=1)
         ->  CTE Scan on t  (cost=0.00..2.02 rows=100 width=8) (actual time=0.067..1.899 rows=100 loops=1)
               Filter: (id IS NOT NULL)
               Rows Removed by Filter: 1
         ->  Result  (cost=13.75..13.76 rows=1 width=0) (actual time=0.002..0.002 rows=0 loops=1)
               One-Time Filter: $5
               InitPlan 5 (returns $5)
                 ->  Index Only Scan using tablename_id on tablename  (cost=0.00..13.75 rows=1 width=0) (actual time=0.002..0.002 rows=0 loops=1)
                       Index Cond: (id IS NULL)
                       Heap Fetches: 0
 Total runtime: 2.040 ms

完整的表定义如下所示:

CREATE TABLE tablename
(
  source character(25),
  id bigint NOT NULL,
  time_ timestamp without time zone,
  height numeric,
  lon numeric,
  lat numeric,
  distance numeric,
  status character(3),
  geom geometry(PointZ,4326),
  relid bigint
)
WITH (
  OIDS=FALSE
);

CREATE INDEX tablename_height
  ON public.tablename
  USING btree
  (height);

CREATE INDEX tablename_geom
  ON public.tablename
  USING gist
  (geom);


CREATE INDEX tablename_id
  ON public.tablename
  USING btree
  (id);

CREATE INDEX tablename_lat
  ON public.tablename
  USING btree
  (lat);

CREATE INDEX tablename_lon
  ON public.tablename
  USING btree
  (lon);

CREATE INDEX tablename_relid
  ON public.tablename
  USING btree
  (relid);

CREATE INDEX tablename_sid
  ON public.tablename
  USING btree
  (source COLLATE pg_catalog."default", id);

CREATE INDEX tablename_source
  ON public.tablename
  USING btree
  (source COLLATE pg_catalog."default");

CREATE INDEX tablename_time
  ON public.tablename
  USING btree
  (time_);

答案选择:

我花了一些时间来比较所提供的方法。有时这样我希望不止一个答案可以被接受,但在这种情况下,我正在给@jjanes打勾。这样做的原因是他的解决方案与最初提出的问题更紧密地匹配,并且我能够获得关于所需WHERE语句的形式的一些见解。最后,HashAggregate实际上是最快的方法(对我来说),但这是由于我的数据的性质,而不是算法的任何问题。我已经为下面的不同方法添加了EXPLAIN ANALYZE,并且将为jjanes和joop提供+1。

HashAggregate:

 HashAggregate  (cost=1018669.72..1029722.08 rows=1105236 width=34) (actual time=24164.735..24686.394 rows=1233080 loops=1)
   ->  Seq Scan on tablename  (cost=0.00..908548.48 rows=22024248 width=34) (actual time=0.054..14639.931 rows=22024982 loops=1)
 Total runtime: 24787.292 ms

索引扫描修改松散

CTE Scan on t  (cost=13.84..15.86 rows=100 width=112) (actual time=0.916..250311.164 rows=1233080 loops=1)
   Filter: (source IS NOT NULL)
   Rows Removed by Filter: 1
   CTE t
     ->  Recursive Union  (cost=0.00..13.84 rows=101 width=112) (actual time=0.911..249295.872 rows=1233081 loops=1)
           ->  Limit  (cost=0.00..0.04 rows=1 width=34) (actual time=0.910..0.911 rows=1 loops=1)
                 ->  Index Only Scan using tablename_sid on tablename  (cost=0.00..965442.32 rows=22024248 width=34) (actual time=0.908..0.908 rows=1 loops=1)
                       Heap Fetches: 0
           ->  WorkTable Scan on t  (cost=0.00..1.18 rows=10 width=112) (actual time=0.201..0.201 rows=1 loops=1233081)
                 Filter: (source IS NOT NULL)
                 Rows Removed by Filter: 0
                 SubPlan 1
                   ->  Limit  (cost=0.00..0.05 rows=1 width=34) (actual time=0.100..0.100 rows=1 loops=1233080)
                         ->  Index Only Scan using tablename_sid on tablename  (cost=0.00..340173.38 rows=7341416 width=34) (actual time=0.100..0.100 rows=1 loops=1233080)
                               Index Cond: (ROW(source, id) > ROW(t.source, t.id))
                               Heap Fetches: 0
                 SubPlan 2
                   ->  Limit  (cost=0.00..0.05 rows=1 width=34) (actual time=0.099..0.099 rows=1 loops=1233080)
                         ->  Index Only Scan using tablename_sid on tablename  (cost=0.00..340173.38 rows=7341416 width=34) (actual time=0.098..0.098 rows=1 loops=1233080)
                               Index Cond: (ROW(source, id) > ROW(t.source, t.id))
                               Heap Fetches: 0
 Total runtime: 250491.559 ms

合并反加入

Merge Anti Join  (cost=0.00..12099015.26 rows=14682832 width=42) (actual time=48.710..541624.677 rows=1233080 loops=1)
   Merge Cond: ((src.source = nx.source) AND (src.id = nx.id))
   Join Filter: (nx.time_ > src.time_)
   Rows Removed by Join Filter: 363464177
   ->  Index Only Scan using tablename_pkey on tablename src  (cost=0.00..1060195.27 rows=22024248 width=42) (actual time=48.566..5064.551 rows=22024982 loops=1)
         Heap Fetches: 0
   ->  Materialize  (cost=0.00..1115255.89 rows=22024248 width=42) (actual time=0.011..40551.997 rows=363464177 loops=1)
         ->  Index Only Scan using tablename_pkey on tablename nx  (cost=0.00..1060195.27 rows=22024248 width=42) (actual time=0.008..8258.890 rows=22024982 loops=1)
               Heap Fetches: 0
 Total runtime: 541750.026 ms

2 个答案:

答案 0 :(得分:2)

相当可怕,但这似乎有效:

WITH RECURSIVE
     t AS (
  select a,b from (select a,b from foo order by a,b limit 1) asdf union all 
  select (select a from foo where (a,b) > (t.a,t.b) order by a,b limit 1),
         (select b from foo where (a,b) > (t.a,t.b) order by a,b limit 1) 
     from t where t.a is not null)  
select * from t where t.a is not null;

我真的不明白为什么需要“不为空”,因为空值首先来自何处?

答案 1 :(得分:1)

DROP SCHEMA zooi CASCADE;
CREATE SCHEMA zooi ;
SET search_path=zooi,public,pg_catalog;

CREATE TABLE tablename
  ( source character(25) NOT NULL
  , id bigint NOT NULL
  , time_ timestamp without time zone NOT NULL
  , height numeric
  , lon numeric
  , lat numeric
  , distance numeric
  , status character(3)
  , geom geometry(PointZ,4326)
  , relid bigint
        , PRIMARY KEY (source,id,time_) -- <<-- Primary key here
  ) WITH ( OIDS=FALSE);


  -- invent some bogus data
INSERT INTO tablename(source,id,time_)
SELECT 'SRC_'|| (gs%10)::text
        ,gs/10
        ,gt
FROM generate_series(1,1000) gs
        , generate_series('2013-12-01', '2013-12-07', '1hour'::interval) gt
        ;

为两个关键字段选择唯一值:

VACUUM ANALYZE tablename;

EXPLAIN ANALYZE
SELECT source,id,time_
FROM tablename src
WHERE NOT EXISTS (
        SELECT * FROM tablename nx
        WHERE nx.source =src.source
        AND nx.id = src.id
        AND time_  > src.time_
        )
        ;

在此处生成此计划(Pg-9.3):

                                                     QUERY PLAN                                                            
----------------------------------------------------------------------------------------------------------------------------------
 Hash Anti Join  (cost=4981.00..12837.82 rows=96667 width=42) (actual time=547.218..1194.335 rows=1000 loops=1)
   Hash Cond: ((src.source = nx.source) AND (src.id = nx.id))
   Join Filter: (nx.time_ > src.time_)
   Rows Removed by Join Filter: 145000
   ->  Seq Scan on tablename src  (cost=0.00..2806.00 rows=145000 width=42) (actual time=0.010..210.810 rows=145000 loops=1)
   ->  Hash  (cost=2806.00..2806.00 rows=145000 width=42) (actual time=546.497..546.497 rows=145000 loops=1)
         Buckets: 16384  Batches: 1  Memory Usage: 9063kB
         ->  Seq Scan on tablename nx  (cost=0.00..2806.00 rows=145000 width=42) (actual time=0.006..259.864 rows=145000 loops=1)
 Total runtime: 1197.374 ms
(9 rows)

一旦数据超出work_mem,哈希联接可能会消失:

 Merge Anti Join  (cost=0.83..8779.56 rows=29832 width=120) (actual time=0.981..2508.912 rows=1000 loops=1)
   Merge Cond: ((src.source = nx.source) AND (src.id = nx.id))
   Join Filter: (nx.time_ > src.time_)
   Rows Removed by Join Filter: 184051
   ->  Index Scan using tablename_sid on tablename src  (cost=0.41..4061.57 rows=32544 width=120) (actual time=0.055..250.621 rows=145000 loops=1)
   ->  Index Scan using tablename_sid on tablename nx  (cost=0.41..4061.57 rows=32544 width=120) (actual time=0.008..603.403 rows=328906 loops=1)
 Total runtime: 2510.505 ms