我的目标是创建一个查询,该查询将返回在365天窗口中购买的唯一客户的数量。我在postgres中创建了下面的查询,结果查询非常慢。我的表是812,024行的订单日期和客户ID。当我删除distinct语句时,我可以让查询在大约60秒内返回结果,有了它,我还没有完成。我在(order_date,id)上创建了一个索引。我是SQL的完全新手,这是我第一次用它做任何事情,并且在试图找到解决这个问题的整天后,我找不到任何可以开始工作的东西,即使我已经看到很多关于不同的缓慢表现。
SELECT
(d1.Ordered) AS Ordered,
COUNT(distinct d2.ID) Users
FROM
(
SELECT order_date AS Ordered
FROM orders
GROUP BY order_date
) d1
INNER JOIN
(
SELECT order_date AS Ordered, id
FROM orders
) d2
ON d2.Ordered BETWEEN d1.Ordered - 364 AND d1.Ordered
GROUP BY d1.Ordered
ORDER BY d1.Ordered
"Sort (cost=3541596.30..3541596.80 rows=200 width=29)"
" Sort Key: orders_1.order_date"
" -> HashAggregate (cost=3541586.66..3541588.66 rows=200 width=29)"
" -> Nested Loop (cost=16121.73..3040838.52 rows=100149627 width=29)"
" -> HashAggregate (cost=16121.30..16132.40 rows=1110 width=4)"
" -> Seq Scan on orders orders_1 (cost=0.00..14091.24 rows=812024 width=4)"
" -> Index Only Scan using x on orders (cost=0.43..1822.70 rows=90225 width=29)"
" Index Cond: ((order_date >= (orders_1.order_date - 364)) AND (order_date <= orders_1.order_date))"
答案 0 :(得分:2)
无需自我加入,请使用generate_series
select
g.order_date as "Ordered",
count(distinct o.id) as "Users"
from
generate_series(
(select min(order_date) from orders),
(select max(order_date) from orders),
'1 day'
) g (order_date)
left join
orders o on o.order_date between g.order_date - 364 and g.order_date
group by 1
order by 1
答案 1 :(得分:1)
你没有展示你的架构,所以有些猜测。根据需要更改列名等。
SELECT
count(DISTINCT users.user_id)
FROM users
INNER JOIN order_date ON (users.user_id = orders.user_id)
WHERE orders.order_date > current_date - INTERVAL '1' YEAR;
或
SELECT
count(users.user_id)
FROM users
INNER JOIN order_date ON (users.user_id = orders.user_id)
WHERE orders.order_date > current_date - INTERVAL '1' YEAR
GROUP BY users.user_id;
答案 2 :(得分:0)
假设实际date
类型。
SELECT d.day, count(distinct o.id) AS users_past_year
FROM (
SELECT generate_series(min(order_date), max(order_date), '1 day')::date AS day
FROM orders -- single query
) d
LEFT JOIN ( -- fold duplicates on same day right away
SELECT id, order_date
FROM orders
GROUP BY 1,2
) o ON o.order_date > d.day - interval '1 year' -- exclude
AND o.order_date <= d.day -- include
GROUP BY 1
ORDER BY 1;
在同一天折叠来自同一用户的多次购买只在这是常见的事情才有意义。否则,省略该步骤会更快,只需左键连接到表orders
。
orders.id
将是用户的ID,这很奇怪。应该命名为user_id
。
如果您对generate_series()
列表中的SELECT
感觉不舒服(效果很好),则可以将其替换为Postgres 9.3 +中的LATERAL JOIN
。
FROM (SELECT min(order_date) AS a
, max(order_date) AS z FROM orders) x
, generate_series(x.a, x.z, '1 day') AS d(day)
LEFT JOIN ...
请注意,在这种情况下,day
的类型为timestamp
。工作原理相同。你可能想要施展。
我知道这是一个单用户的只读表。这简化了事情 你似乎已经有了一个索引:
CREATE INDEX orders_mult_idx ON orders (order_date, id);
那很好。
有些事情要尝试:
当然,通常的表现建议适用:
https://wiki.postgresql.org/wiki/Slow_Query_Questions
https://wiki.postgresql.org/wiki/Performance_Optimization
使用此索引对表进行一次聚类:
CLUSTER orders USING orders_mult_idx;
这应该有所帮助。它还有效地在表上运行VACUUM FULL
,如果适用,它会删除任何死行并压缩表。
ALTER TABLE orders ALTER COLUMN number SET STATISTICS 1000;
ANALYZE orders;
这里的解释:
确保您分配了充足的资源。特别是shared_buffers
and work_mem
。您可以暂时为会话执行此操作。
尝试禁用嵌套循环(enable_nestloop
)(仅在您的会话中)。也许散列连接更快。 (不过我会感到惊讶。)
SET enable_nestedloop = off;
-- test ...
RESET enable_nestedloop;
因为这似乎是一个临时表&#34;本质上,您可以尝试将其设置为仅保存在RAM中的实际临时表。您需要足够的RAM来分配足够的temp_buffers
。详细说明:
务必手动运行ANALYZE
。临时表不包括在autovacuum中。