Question

问题在于：

这是一些示例数据：

cats=# select * from cats limit 8;
   id    | color |  breed  
---------+-------+---------
 4380929 | grey  | persian
 4380930 | grey  | siese
 4380931 | white | persian
 4380932 | white | siamese
 4380933 | grey  | persian
 4380934 | grey  | siese
 4380935 | white | persian
 4380936 | white | siamese
(8 rows)

以下是构建数据库的方法：

psql postgres postgres -c "CREATE DATABASE cats;"
psql cats postgres -c 'CREATE SEQUENCE cat_id_seq;'
psql cats postgres -c "CREATE TABLE cats (id BIGINT NOT NULL default nextval('cat_id_seq'), color text, breed text);"
bash -c 'for i in `seq 1 1000000` ; do echo -e "white\tpersian\nwhite\tsiamese\ngrey\tpersian\ngrey\tsiese"; done;' > /tmp/cats.sql
psql cats postgres -c "COPY cats (color, breed) FROM /tmp/cats.sql"

这是查询：

psql cats postgres -c "select distinct((color,breed)) from cats;"

运行此查询需要我：

 Unique  (cost=783138.21..805138.22 rows=6 width=12) (actual time=69816.259..81338.631 rows=5 loops=1)
   ->  Sort  (cost=783138.21..794138.22 rows=4400001 width=12) (actual time=69816.258..80412.546 rows=4400001 loops=1)
     Sort Key: (ROW(color, breed))
     Sort Method: external merge  Disk: 189456kB
     ->  Seq Scan on cats  (cost=0.00..72026.01 rows=4400001 width=12) (actual time=0.013..846.713 rows=4400001 loops=1)
 Total runtime: 81363.373 ms
(6 rows)

输出：

(grey,persian)
(grey,siamese)
(grey,siese)
(white,persian)
(white,siamese)
(5 rows)

你知道我怎么能这么快吗？

这样可行，但仅适用于一个属性，而不是两个属性，如下所示：http://zogovic.com/post/44856908222/optimizing-postgresql-query-for-distinct-values

我想我需要一个关于＆＃39;（颜色，品种）＆＃39;然后：

创建临时表TEMP（颜色，品种）
插入TEMP（从猫中选择（颜色，品种），其中（颜色，品种）不在TEMP中）
直到没有更多要插入......
从TEMP中选择*

但我不知道如何在postgres上写这个（没有很多面包店） - 我应该使用RECURSIVE吗？还是plpgsql？

谢谢！

Answer 1

所以，经过大量的工作，这就是解决方案：

首先 - 创建索引：

create index ON cats (color,breed);

首先：简单查询：

cats=# select distinct color,breed from cats;
       row       
-----------------
 (a,b)
 (c,d)
 (grey,persian)
 (grey,siamese)
 (grey,siese)
 (white,persian)
 (white,siamese)
(7 rows)

Time: 853.550 ms

现在您要使用的版本：

WITH RECURSIVE distinct_pairs AS (
    (
        SELECT c as cl FROM cats c where color IS NOT NULL AND breed IS NOT NULL order by c.color,c.breed LIMIT 1
    )
    UNION ALL
    SELECT (
        SELECT c
        FROM cats c
        WHERE
            (c.color,c.breed) > ((p.cl).color,(p.cl).breed)
        ORDER BY c.color,c.breed LIMIT 1
    )
    FROM distinct_pairs p
    WHERE (p.cl).id IS NOT NULL
) SELECT * FROM distinct_pairs p WHERE (p.cl).id IS NOT NULL;
         cl          
---------------------
 (4400007,a,b)
 (5,grey,persian)
 (6,grey,siamese)
 (400006,grey,siese)
 (2,white,persian)
 (4,white,siamese)
(6 rows)

Time: 0.646 ms

快1300倍。还不错。

感谢：

Answer 2

为什么你有括号？你知道他们做了什么，你需要吗？

如果我放弃它，我的速度提高了大约20倍：

select distinct color, breed from cats;

将列包装到记录中，然后为每次排序比较解压缩记录，这是很多工作。

优化选择区别

2 个答案: