问题在于:
这是一些示例数据:
cats=# select * from cats limit 8;
id | color | breed
---------+-------+---------
4380929 | grey | persian
4380930 | grey | siese
4380931 | white | persian
4380932 | white | siamese
4380933 | grey | persian
4380934 | grey | siese
4380935 | white | persian
4380936 | white | siamese
(8 rows)
以下是构建数据库的方法:
psql postgres postgres -c "CREATE DATABASE cats;"
psql cats postgres -c 'CREATE SEQUENCE cat_id_seq;'
psql cats postgres -c "CREATE TABLE cats (id BIGINT NOT NULL default nextval('cat_id_seq'), color text, breed text);"
bash -c 'for i in `seq 1 1000000` ; do echo -e "white\tpersian\nwhite\tsiamese\ngrey\tpersian\ngrey\tsiese"; done;' > /tmp/cats.sql
psql cats postgres -c "COPY cats (color, breed) FROM /tmp/cats.sql"
这是查询:
psql cats postgres -c "select distinct((color,breed)) from cats;"
运行此查询需要我:
Unique (cost=783138.21..805138.22 rows=6 width=12) (actual time=69816.259..81338.631 rows=5 loops=1)
-> Sort (cost=783138.21..794138.22 rows=4400001 width=12) (actual time=69816.258..80412.546 rows=4400001 loops=1)
Sort Key: (ROW(color, breed))
Sort Method: external merge Disk: 189456kB
-> Seq Scan on cats (cost=0.00..72026.01 rows=4400001 width=12) (actual time=0.013..846.713 rows=4400001 loops=1)
Total runtime: 81363.373 ms
(6 rows)
输出:
(grey,persian)
(grey,siamese)
(grey,siese)
(white,persian)
(white,siamese)
(5 rows)
你知道我怎么能这么快吗?
这样可行,但仅适用于一个属性,而不是两个属性,如下所示:http://zogovic.com/post/44856908222/optimizing-postgresql-query-for-distinct-values
我想我需要一个关于'(颜色,品种)'然后:
但我不知道如何在postgres上写这个(没有很多面包店) - 我应该使用RECURSIVE吗?还是plpgsql?
谢谢!
答案 0 :(得分:1)
所以,经过大量的工作,这就是解决方案:
首先 - 创建索引:
create index ON cats (color,breed);
首先:简单查询:
cats=# select distinct color,breed from cats;
row
-----------------
(a,b)
(c,d)
(grey,persian)
(grey,siamese)
(grey,siese)
(white,persian)
(white,siamese)
(7 rows)
Time: 853.550 ms
现在您要使用的版本:
WITH RECURSIVE distinct_pairs AS (
(
SELECT c as cl FROM cats c where color IS NOT NULL AND breed IS NOT NULL order by c.color,c.breed LIMIT 1
)
UNION ALL
SELECT (
SELECT c
FROM cats c
WHERE
(c.color,c.breed) > ((p.cl).color,(p.cl).breed)
ORDER BY c.color,c.breed LIMIT 1
)
FROM distinct_pairs p
WHERE (p.cl).id IS NOT NULL
) SELECT * FROM distinct_pairs p WHERE (p.cl).id IS NOT NULL;
cl
---------------------
(4400007,a,b)
(5,grey,persian)
(6,grey,siamese)
(400006,grey,siese)
(2,white,persian)
(4,white,siamese)
(6 rows)
Time: 0.646 ms
快1300倍。还不错。
感谢:
答案 1 :(得分:0)
为什么你有括号?你知道他们做了什么,你需要吗?
如果我放弃它,我的速度提高了大约20倍:
select distinct color, breed from cats;
将列包装到记录中,然后为每次排序比较解压缩记录,这是很多工作。