优化选择区别

时间:2014-04-11 09:04:56

标签: sql postgresql query-optimization

问题在于:

这是一些示例数据:

cats=# select * from cats limit 8;
   id    | color |  breed  
---------+-------+---------
 4380929 | grey  | persian
 4380930 | grey  | siese
 4380931 | white | persian
 4380932 | white | siamese
 4380933 | grey  | persian
 4380934 | grey  | siese
 4380935 | white | persian
 4380936 | white | siamese
(8 rows)

以下是构建数据库的方法:

psql postgres postgres -c "CREATE DATABASE cats;"
psql cats postgres -c 'CREATE SEQUENCE cat_id_seq;'
psql cats postgres -c "CREATE TABLE cats (id BIGINT NOT NULL default nextval('cat_id_seq'), color text, breed text);"
bash -c 'for i in `seq 1 1000000` ; do echo -e "white\tpersian\nwhite\tsiamese\ngrey\tpersian\ngrey\tsiese"; done;' > /tmp/cats.sql
psql cats postgres -c "COPY cats (color, breed) FROM /tmp/cats.sql"

这是查询:

psql cats postgres -c "select distinct((color,breed)) from cats;"

运行此查询需要我:

 Unique  (cost=783138.21..805138.22 rows=6 width=12) (actual time=69816.259..81338.631 rows=5 loops=1)
   ->  Sort  (cost=783138.21..794138.22 rows=4400001 width=12) (actual time=69816.258..80412.546 rows=4400001 loops=1)
     Sort Key: (ROW(color, breed))
     Sort Method: external merge  Disk: 189456kB
     ->  Seq Scan on cats  (cost=0.00..72026.01 rows=4400001 width=12) (actual time=0.013..846.713 rows=4400001 loops=1)
 Total runtime: 81363.373 ms
(6 rows)

输出:

(grey,persian)
(grey,siamese)
(grey,siese)
(white,persian)
(white,siamese)
(5 rows)

你知道我怎么能这么快吗?

这样可行,但仅适用于一个属性,而不是两个属性,如下所示:http://zogovic.com/post/44856908222/optimizing-postgresql-query-for-distinct-values

我想我需要一个关于'(颜色,品种)'然后:

  • 创建临时表TEMP(颜色,品种)
  • 插入TEMP(从猫中选择(颜色,品种),其中(颜色,品种)不在TEMP中)
  • 直到没有更多要插入......
  • 从TEMP中选择*

但我不知道如何在postgres上写这个(没有很多面包店) - 我应该使用RECURSIVE吗​​?还是plpgsql?

谢谢!

2 个答案:

答案 0 :(得分:1)

所以,经过大量的工作,这就是解决方案:

首先 - 创建索引:

create index ON cats (color,breed);

首先:简单查询:

cats=# select distinct color,breed from cats;
       row       
-----------------
 (a,b)
 (c,d)
 (grey,persian)
 (grey,siamese)
 (grey,siese)
 (white,persian)
 (white,siamese)
(7 rows)

Time: 853.550 ms

现在您要使用的版本:

WITH RECURSIVE distinct_pairs AS (
    (
        SELECT c as cl FROM cats c where color IS NOT NULL AND breed IS NOT NULL order by c.color,c.breed LIMIT 1
    )
    UNION ALL
    SELECT (
        SELECT c
        FROM cats c
        WHERE
            (c.color,c.breed) > ((p.cl).color,(p.cl).breed)
        ORDER BY c.color,c.breed LIMIT 1
    )
    FROM distinct_pairs p
    WHERE (p.cl).id IS NOT NULL
) SELECT * FROM distinct_pairs p WHERE (p.cl).id IS NOT NULL;
         cl          
---------------------
 (4400007,a,b)
 (5,grey,persian)
 (6,grey,siamese)
 (400006,grey,siese)
 (2,white,persian)
 (4,white,siamese)
(6 rows)

Time: 0.646 ms

快1300倍。还不错。

感谢:

答案 1 :(得分:0)

为什么你有括号?你知道他们做了什么,你需要吗?

如果我放弃它,我的速度提高了大约20倍:

select distinct color, breed from cats;

将列包装到记录中,然后为每次排序比较解压缩记录,这是很多工作。