我有一个记录客户购买历史记录的日志表purchase_history
,我想通过customer_id
获取给定date_purchased
订单的每种产品的最新购买信息。< / p>
该表有数百万条记录,对于某些customer_id
,我所拥有的解决方案非常慢(20秒以上),其中包含表格中的大多数记录(某些customer_id
的25%记录例如),对于只有几行的其他customer_id
,它非常快(1秒)。
表格定义:
create table purchase_history (
id int PRIMARY KEY,
product_name VARCHAR(100),
date_purchased date,
customer_id int
);
一些虚拟数据:
INSERT into purchase_history VALUES (
1, 'A', '2017-10-10', 123)
, (2, 'A', '2017-10-11', 123)
, (3, 'B', '2017-10-12', 123)
, (4, 'C', '2017-10-09', 123)
, (5, 'B', '2017-11-10', 123);
我有一个多列索引(customer_id
,product_name
,date_purchased
)
结果我缩进得到:
5,B,2017-11-10
2,A,2017-10-11
4,C,2017-10-09
解决方案我到目前为止:
SELECT *
FROM (
SELECT DISTINCT ON (product_name) *
FROM purchase_history
WHERE customer_id = 123
ORDER BY product_name, date_purchased DESC
) t
ORDER BY date_purchased DESC;
我想知道是否有更好或更快的解决方案?
更新:01/14/2018
感谢您的评论和答案,并对此感到抱歉。我想补充一些细节:
not null
,包括date_purchased
我所拥有的索引与排序(date_purchased DESC
)
create index purchase_history_idx on purchase_history(customer_id, product_name, date_purchased DESC)
使用引用另一个表的product_id
是一个好点,但不幸的是production_name
在任何其他表中都不存在。它是客户指定的名称。假设我有一个UI供客户输入他们想要购买的内容,客户输入的内容是product_name
。因此purchase_history
会跟踪所有客户的所有“愿望清单”。
记录数量:
customer_id=123
是我们最大的客户,包含8573491条记录,或42%customer_id=124
是我们的第二大客户,包含3062464条记录,或15%以下是我原来的distinct on
解决方案的解释分析:
Sort (cost=2081285.86..2081607.09 rows=128492 width=106) (actual time=11771.444..12012.732 rows=623680 loops=1)
Sort Key: purchase_history.date_purchased
Sort Method: external merge Disk: 69448kB
-> Unique (cost=0.56..2061628.55 rows=128492 width=106) (actual time=0.021..11043.910 rows=623680 loops=1)
-> Index Scan using purchase_history_idx on purchase_history (cost=0.56..2040413.77 rows=8485910 width=106) (actual time=0.019..8506.109 rows=8573491 loops=1)
Index Cond: (customer_id = 123)
Planning time: 0.098 ms
Execution time: 12133.664 ms
以下是来自Erwin的CTE解决方案的解释分析
Sort (cost=125.62..125.87 rows=101 width=532) (actual time=30924.208..31154.908 rows=623680 loops=1)
Sort Key: cte.date_purchased
Sort Method: external merge Disk: 33880kB
CTE cte
-> Recursive Union (cost=0.56..120.23 rows=101 width=39) (actual time=0.022..29772.944 rows=623680 loops=1)
-> Limit (cost=0.56..0.80 rows=1 width=39) (actual time=0.020..0.020 rows=1 loops=1)
-> Index Scan using purchase_history_idx on purchase_history (cost=0.56..2040413.77 rows=8485910 width=39) (actual time=0.019..0.019 rows=1 loops=1)
Index Cond: (customer_id = 123)
-> Nested Loop (cost=0.56..11.74 rows=10 width=39) (actual time=0.046..0.047 rows=1 loops=623680)
-> WorkTable Scan on cte c (cost=0.00..0.20 rows=10 width=516) (actual time=0.000..0.000 rows=1 loops=623680)
-> Limit (cost=0.56..1.13 rows=1 width=39) (actual time=0.045..0.045 rows=1 loops=623680)
-> Index Scan using purchase_history_idx on purchased_history purchased_history_1 (cost=0.56..1616900.83 rows=2828637 width=39) (actual time=0.044..0.044 rows=1 loops=623680)
Index Cond: ((customer_id = 123) AND ((product_name)::text > (c.product_name)::text))
-> CTE Scan on cte (cost=0.00..2.02 rows=101 width=532) (actual time=0.024..30269.107 rows=623680 loops=1)
Planning time: 0.207 ms
Execution time: 31273.462 ms
让我感到惊讶的是,我的查询运行速度慢得多customer_id=124
,其中包含的记录少于customer_id=123
(注意:不使用索引扫描,而是使用位图索引扫描我不知道为什么)
Sort (cost=1323695.21..1323812.68 rows=46988 width=106) (actual time=85739.561..85778.735 rows=109347 loops=1)
Sort Key: purchase_history.date_purchased
Sort Method: external merge Disk: 14560kB
-> Unique (cost=1301329.65..1316845.56 rows=46988 width=106) (actual time=60443.890..85608.347 rows=109347 loops=1)
-> Sort (cost=1301329.65..1309087.61 rows=3103183 width=106) (actual time=60443.888..84727.062 rows=3062464 loops=1)
" Sort Key: purchase_history.product_name, purchase_history.date_purchased"
Sort Method: external merge Disk: 427240kB
-> Bitmap Heap Scan on purchase_history (cost=203634.23..606098.02 rows=3103183 width=106) (actual time=8340.662..10584.483 rows=3062464 loops=1)
Recheck Cond: (customer_id = 124)
Rows Removed by Index Recheck: 4603902
Heap Blocks: exact=41158 lossy=132301
-> Bitmap Index Scan on purchase_history_idx (cost=0.00..202858.43 rows=3103183 width=0) (actual time=8331.711..8331.711 rows=3062464 loops=1)
Index Cond: (customer_id = 124)
Planning time: 0.102 ms
Execution time: 85872.871 ms
更新01/15/2018
以下是riskop提出的explain (analyze,buffers)
:
GroupAggregate (cost=0.56..683302.46 rows=128492 width=31) (actual time=0.028..5156.113 rows=623680 loops=1)
Group Key: product_name
Buffers: shared hit=1242675
-> Index Only Scan using purchase_history_idx on purchase_history (cost=0.56..639587.99 rows=8485910 width=31) (actual time=0.022..2673.661 rows=8573491 loops=1)
Index Cond: (customer_id = 123)
Heap Fetches: 0
Buffers: shared hit=1242675
Planning time: 0.079 ms
Execution time: 5272.877 ms
注意我不能使用此查询,即使它更快有两个原因:
date_purchased DESC
group by
。解决这两个问题的一种方法是使用基于风险的基于group by
的查询作为子查询或CTE,根据需要添加order by
和更多列。
更新01/21/2018
利用“松散索引扫描”听起来不错,但不幸的是product_name
高度分布。共有1810440个唯一product_name
和2565179独特product_name
和customer_id
组合:
select count(distinct product_name) from purchase_history; -- 1810440
select count(distinct (customer_id, product_name)) from purchase_history; -- 2565179
结果,对于我来说,对Riskop的313ms查询花了33秒:
Sort (cost=122.42..122.68 rows=101 width=532) (actual time=33509.943..33748.856 rows=623680 loops=1)
Sort Key: cte.date_purchased
Sort Method: external merge Disk: 33880kB
" Buffers: shared hit=3053791 read=69706, temp read=4244 written=8484"
CTE cte
-> Recursive Union (cost=0.56..117.04 rows=101 width=39) (actual time=5.886..32288.212 rows=623680 loops=1)
Buffers: shared hit=3053788 read=69706
-> Limit (cost=0.56..0.77 rows=1 width=39) (actual time=5.885..5.885 rows=1 loops=1)
Buffers: shared hit=5 read=3
-> Index Scan using purchase_history_idx on purchase_history (cost=0.56..1809076.40 rows=8543899 width=39) (actual time=5.882..5.882 rows=1 loops=1)
Index Cond: (customer_id = 123)
Buffers: shared hit=5 read=3
-> Nested Loop (cost=0.56..11.42 rows=10 width=39) (actual time=0.050..0.051 rows=1 loops=623680)
Buffers: shared hit=3053783 read=69703
-> WorkTable Scan on cte c (cost=0.00..0.20 rows=10 width=516) (actual time=0.000..0.000 rows=1 loops=623680)
-> Limit (cost=0.56..1.10 rows=1 width=39) (actual time=0.049..0.049 rows=1 loops=623680)
Buffers: shared hit=3053783 read=69703
-> Index Scan using purchase_history_idx on purchase_history purchase_history_1 (cost=0.56..1537840.29 rows=2847966 width=39) (actual time=0.048..0.048 rows=1 loops=623680)
Index Cond: ((customer_id = 123) AND ((product_name)::text > (c.product_name)::text))
Buffers: shared hit=3053783 read=69703
-> CTE Scan on cte (cost=0.00..2.02 rows=101 width=532) (actual time=5.889..32826.816 rows=623680 loops=1)
" Buffers: shared hit=3053788 read=69706, temp written=4240"
Planning time: 0.278 ms
Execution time: 33873.798 ms
请注意它在内存中排序:Sort Method: quicksort Memory: 853kB
表示Riskop,但外部磁盘排序:Sort Method: external merge Disk: 33880kB
对我来说。
如果它不是关系数据库的可解决问题,我想知道是否有任何其他非关系数据库或基于大数据的解决方案,只要它满足2个要求:
答案 0 :(得分:1)
尝试明确表达您的SELECT *
FROM purchase_history ph
JOIN
(
SELECT product_name, MAX(date_purchased) max_date_purchased
FROM purchase_history
WHERE customer_id = 123
GROUP BY product_name
) t ON ph.product_name = t.product_name and
ph.date_purchased = t.max_date_purchased
ph.customer_id = 123
ORDER BY ph.date_purchased DESC;
SELECT *
FROM
(
SELECT *,
dense_rank() over (partition by product_name order by date_purchased desc) rn
FROM purchase_history
WHERE customer_id = 123
) t
WHERE t.rn = 1
ORDER BY t.date_purchased DESC;
另一个解决方案是使用窗口函数
cv2.VideoWriter_fourcc(*'mpeg')
测试它,你会看到哪一个更高效。
答案 1 :(得分:1)
Postgres可以非常有效地向后扫描索引,但我仍然可以使索引完美匹配:
(customer_id, product_name, date_purchased DESC)
这是次要的优化,但由于date_purchased
根据您的表格定义可能为NULL,您可能需要ORDER BY product_name, date_purchased DESC
NULLS LAST
,这应该是使用匹配的索引 - 这是一个主要的优化:
CREATE INDEX new_idx ON purchase_history
(customer_id, product_name, date_purchased DESC NULLS LAST);
相关:
DISTINCT ON
对于{strong> 少数 行每(customer_id, product_name)
行非常有效,但对于 很多则不那么有效行,这是你的弱点。
此递归CTE 应该能够完美地使用匹配的索引:
WITH RECURSIVE cte AS (
( -- parentheses required
SELECT id, product_name, date_purchased
FROM purchase_history
WHERE customer_id = 123
ORDER BY product_name, date_purchased DESC NULLS LAST
LIMIT 1
)
UNION ALL
SELECT u.*
FROM cte c
, LATERAL (
SELECT id, product_name, date_purchased
FROM purchase_history
WHERE customer_id = 123 -- repeat condition
AND product_name > c.product_name -- lateral reference
ORDER BY product_name, date_purchased DESC NULLS LAST
LIMIT 1
) u
)
TABLE cte
ORDER BY date_purchased DESC NULLS LAST;
dbfiddle here
相关,详细说明:
您甚至可以为具有多行的客户分配逻辑并运行rCTE,而对于行数较少的客户则坚持DISTINCT ON
...
值得注意的是,您的表格purchase_history
有product_name VARCHAR(100)
。在一个完美的世界(规范化模式)中,这将是 product_id int
(使用对product
表的FK引用)。这有助于以多种方式提高性能:更小的表和索引,在integer
而不是varchar(100)
上的操作速度大大加快。
Realted:
答案 2 :(得分:1)
我认为最重要的问题是数据中product_name的分布是什么。
您提到用户使用产品名称填写此内容,因此我 猜测 您有几千个不同的product_name值。
如果 就是这种情况,那么我认为你的问题是Postgresql没有使用“松散索引扫描”(https://wiki.postgresql.org/wiki/Loose_indexscan),即使这是不同的与总记录数相比,这些值很小。
描述与您非常相似的案例的好文章:http://malisper.me/the-missing-postgres-scan-the-loose-index-scan/
所以我试图重现你的大数据集。由以下过程创建的测试数据包含2000万行。有10000种产品(product_name是0到10000之间的随机值)。有45个不同的customer_id,43%是“123”,15%是“124”,其余42%是在59到100之间随机分配.date_purchased是1092-04-05和1913-08-19之间的随机日。
do '
begin
drop table purchase_history;
create table purchase_history (
id int,
product_name VARCHAR(100) not null,
date_purchased date not null,
customer_id int not null
);
FOR i IN 0..20000000 - 1 LOOP
insert into purchase_history values (
i,
(select trunc(random() * 10000)),
to_date('''' || (select trunc(random() * 300000 + 2120000)), ''J''),
(select trunc(random() * 100))
);
end loop;
update purchase_history set customer_id=123 where customer_id < 43;
update purchase_history set customer_id=124 where customer_id < 58;
ALTER TABLE purchase_history ADD PRIMARY KEY (id);
end;
'
索引与帖子中的索引相同:
CREATE INDEX idx ON purchase_history
(customer_id, product_name, date_purchased desc);
只是为了确保我们确实有10000个不同的product_name:
SELECT product_name FROM purchase_history GROUP BY product_name;
现在,“参考”查询在此数据集上以3200毫秒运行:
explain (analyze,buffers)
SELECT product_name, max(date_purchased)
FROM purchase_history
WHERE customer_id = 123
GROUP BY product_name
order by max(date_purchased) desc;
执行:
Sort (cost=171598.50..171599.00 rows=200 width=222) (actual time=3219.176..3219.737 rows=10000 loops=1)
Sort Key: (max(date_purchased)) DESC
Sort Method: quicksort Memory: 853kB
Buffers: shared hit=3 read=105201 written=11891
-> HashAggregate (cost=171588.86..171590.86 rows=200 width=222) (actual time=3216.382..3217.361 rows=10000 loops=1)
Group Key: product_name
Buffers: shared hit=3 read=105201 written=11891
-> Bitmap Heap Scan on purchase_history (cost=2319.56..171088.86 rows=100000 width=222) (actual time=766.196..1634.934 rows=8599329 loops=1)
Recheck Cond: (customer_id = 123)
Rows Removed by Index Recheck: 15263
Heap Blocks: exact=45627 lossy=26625
Buffers: shared hit=3 read=105201 written=11891
-> Bitmap Index Scan on idx (cost=0.00..2294.56 rows=100000 width=0) (actual time=759.686..759.686 rows=8599329 loops=1)
Index Cond: (customer_id = 123)
Buffers: shared hit=3 read=32949 written=11859
Planning time: 0.192 ms
Execution time: 3220.096 ms
优化查询 - 与Erwin基本相同 - 使用索引并在迭代CTE的帮助下执行“松散索引扫描”(误导性命名为“递归”CTE)仅运行310毫秒,大约10倍更快:
explain (analyze,buffers)
WITH RECURSIVE cte AS (
( -- parentheses required
SELECT id, product_name, date_purchased
FROM purchase_history
WHERE customer_id = 123
ORDER BY product_name, date_purchased DESC
LIMIT 1
)
UNION ALL
SELECT u.*
FROM cte c
, LATERAL (
SELECT id, product_name, date_purchased
FROM purchase_history
WHERE customer_id = 123 -- repeat condition
AND product_name > c.product_name -- lateral reference
ORDER BY product_name, date_purchased DESC
LIMIT 1
) u
)
TABLE cte
ORDER BY date_purchased DESC NULLS LAST;
执行:
Sort (cost=444.02..444.27 rows=101 width=226) (actual time=312.928..313.585 rows=10000 loops=1)
Sort Key: cte.date_purchased DESC NULLS LAST
Sort Method: quicksort Memory: 853kB
Buffers: shared hit=31432 read=18617 written=14
CTE cte
-> Recursive Union (cost=0.56..438.64 rows=101 width=226) (actual time=0.054..308.678 rows=10000 loops=1)
Buffers: shared hit=31432 read=18617 written=14
-> Limit (cost=0.56..3.79 rows=1 width=226) (actual time=0.052..0.053 rows=1 loops=1)
Buffers: shared hit=4 read=1
-> Index Scan using idx on purchase_history (cost=0.56..322826.56 rows=100000 width=226) (actual time=0.050..0.050 rows=1 loops=1)
Index Cond: (customer_id = 123)
Buffers: shared hit=4 read=1
-> Nested Loop (cost=0.56..43.28 rows=10 width=226) (actual time=0.030..0.030 rows=1 loops=10000)
Buffers: shared hit=31428 read=18616 written=14
-> WorkTable Scan on cte c (cost=0.00..0.20 rows=10 width=218) (actual time=0.000..0.000 rows=1 loops=10000)
-> Limit (cost=0.56..4.29 rows=1 width=226) (actual time=0.030..0.030 rows=1 loops=10000)
Buffers: shared hit=31428 read=18616 written=14
-> Index Scan using idx on purchase_history purchase_history_1 (cost=0.56..124191.22 rows=33333 width=226) (actual time=0.030..0.030 rows=1 loops=10000)
Index Cond: ((customer_id = 123) AND ((product_name)::text > (c.product_name)::text))
Buffers: shared hit=31428 read=18616 written=14
-> CTE Scan on cte (cost=0.00..2.02 rows=101 width=226) (actual time=0.058..310.821 rows=10000 loops=1)
Buffers: shared hit=31432 read=18617 written=14
Planning time: 0.418 ms
Execution time: 313.988 ms
答案 3 :(得分:0)
您能告诉我们您环境中以下简化查询的结果吗?
explain (analyze,buffers)
SELECT product_name, max(date_purchased)
FROM purchase_history
WHERE customer_id = 123
GROUP BY product_name;