我们有一个简单的通用表结构,在PostgreSQL中实现(8.3; 9.1就在我们的视野中)。这似乎是一个非常简单和常见的实现。归结为:
events_event_types
(
# this table holds some 50 rows
id bigserial # PK
"name" character varying(255)
)
events_events
(
# this table holds some 15M rows
id bigserial # PK
datetime timestamp with time zone
eventtype_id bigint # FK to events_event_types.id
)
CREATE TABLE events_eventdetails
(
# this table holds some 65M rows
id bigserial # PK
keyname character varying(255)
"value" text
event_id bigint # FK to events_events.id
)
events_events和events_eventdetails表中的某些行将如下所示:
events_events | events_eventdetails
id datetime eventtype_id | id keyname value event_id
----------------------------|-------------------------------------------
100 ... 10 | 1000 transactionId 9774ae16-... 100
| 1001 someKey some value 100
200 ... 20 | 2000 transactionId 9774ae16-... 200
| 2001 reductionId 123 200
| 2002 reductionId 456 200
300 ... 30 | 3000 transactionId 9774ae16-... 300
| 2001 customerId 234 300
| 2001 companyId 345 300
我们迫切需要一个“解决方案”,它将events_events行100和200和300一起返回到一个结果集中并快速完成!当被要求提供reductionId = 123或当被要求提供customerId = 234或被要求提供companyId = 345时。 (可能对这些标准的AND组合感兴趣,但这基本上不是目标。) 不确定此时是否重要,但结果集应该可以在日期时间范围和eventtype_id(IN列表)中过滤并获得LIMIT。
我要求一个“解决方案”,因为这可能是:
这不是一个新问题,因为我们在几个月内尝试了所有这三种方法(不会因为这些查询而烦恼),但这一切都失败了。该解决方案应返回<<<< 1。以前的尝试花了大约。最多10秒。
我真的很感激一些帮助 - 我现在不知所措......
两个较小的查询方法看起来很像:
查询1:
SELECT Substring(details2_transvalue.VALUE, 0, 32)
FROM events_eventdetails details2_transvalue
JOIN events_eventdetails compdetails ON details2_transvalue.event_id = compdetails.event_id
AND compdetails.keyname = 'companyId'
AND Substring(compdetails.VALUE, 0, 32) = '4'
AND details2_transvalue.keyname = 'transactionId'
查询2:
SELECT events1.*
FROM events_events events1
JOIN events_eventdetails compDetails ON events1.id = compDetails.event_id
AND compDetails.keyname='companyId'
AND substring(compDetails.value,0,32)='4'
WHERE events1.eventtype_id IN (...)
UNION
SELECT events2.*
FROM events_events events2
JOIN events_eventdetails details2_transKey ON events2.id = details2_transKey.event_id
AND details2_transKey.keyname='transactionId'
AND substring(details2_transKey.value,0,32) IN ( -- result of query 1 goes here -- )
WHERE events2.eventtype_id IN (...)
ORDER BY dateTime DESC LIMIT 50
由于查询1返回的大集合,性能变差。
如您所见,events_eventdetails表中的值始终表示为长度为32个子字符串,我们已将其编入索引。关于keyname,event_id,event_id + keyname,keyname + length 32 substring的进一步索引。
这是一种PostgreSQL 9.1方法 - 尽管我没有正式使用该平台:
WITH companyevents AS (
SELECT events1.*
FROM events_events events1
JOIN events_eventdetails compDetails
ON events1.id = compDetails.event_id
AND compDetails.keyname='companyId'
AND substring(compDetails.value,0,32)=' -- my desired companyId -- '
WHERE events1.eventtype_id in (...)
ORDER BY dateTime DESC
LIMIT 50
)
SELECT * from events_events
WHERE transaction_id IN (SELECT transaction_id FROM companyevents)
OR id IN (SELECT id FROM companyevents)
AND eventtype_id IN (...)
ORDER BY dateTime DESC
LIMIT 250;
对于具有28228 transactionIds的companyId,查询计划如下:
Limit (cost=7545.99..7664.33 rows=250 width=130) (actual time=210.100..3026.267 rows=50 loops=1)
CTE companyevents
-> Limit (cost=7543.62..7543.74 rows=50 width=130) (actual time=206.994..207.020 rows=50 loops=1)
-> Sort (cost=7543.62..7544.69 rows=429 width=130) (actual time=206.993..207.005 rows=50 loops=1)
Sort Key: events1.datetime
Sort Method: top-N heapsort Memory: 23kB
-> Nested Loop (cost=10.02..7529.37 rows=429 width=130) (actual time=0.093..178.719 rows=28228 loops=1)
-> Append (cost=10.02..1140.62 rows=657 width=8) (actual time=0.082..27.594 rows=28228 loops=1)
-> Bitmap Heap Scan on events_eventdetails compdetails (cost=10.02..394.47 rows=97 width=8) (actual time=0.021..0.021 rows=0 loops=1)
Recheck Cond: (((keyname)::text = 'companyId'::text) AND ("substring"(value, 0, 32) = '4'::text))
-> Bitmap Index Scan on events_eventdetails_substring_ind (cost=0.00..10.00 rows=97 width=0) (actual time=0.019..0.019 rows=0 loops=1)
Index Cond: (((keyname)::text = 'companyId'::text) AND ("substring"(value, 0, 32) = '4'::text))
-> Index Scan using events_eventdetails_companyid_substring_ind on events_eventdetails_companyid compdetails (cost=0.00..746.15 rows=560 width=8) (actual time=0.061..18.655 rows=28228 loops=1)
Index Cond: (((keyname)::text = 'companyId'::text) AND ("substring"(value, 0, 32) = '4'::text))
-> Index Scan using events_events_pkey on events_events events1 (cost=0.00..9.71 rows=1 width=130) (actual time=0.004..0.004 rows=1 loops=28228)
Index Cond: (id = compdetails.event_id)
Filter: (eventtype_id = ANY ('{103,106,107,110,45,34,14,87,58,78,7,76,42,11,25,57,98,37,30,35,33,49,52,29,74,28,85,59,51,65,66,18,13,86,75,6,44,38,43,94,56,95,96,71,50,81,90,89,16,17,4,88,79,77,68,97,92,67,72,53,2,10,31,32,80,111,104,93,26,8,61,5,73,70,63,20,60,40,41,23,22,48,36,108,99,64,62,55,69,19,46,47,15,54,100,101,27,21,12,102,105,109,112,113,114,115,116,119,120,121,122,123,124,9,127,24,130,132,129,125,131,118,117,133,134}'::bigint[]))
-> Index Scan Backward using events_events_datetime_ind on events_events (cost=2.25..1337132.75 rows=2824764 width=130) (actual time=210.100..3026.255 rows=50 loops=1)
Filter: ((hashed SubPlan 2) OR ((hashed SubPlan 3) AND (eventtype_id = ANY ('{103,106,107,110,45,34,14,87,58,78,7,76,42,11,25,57,98,37,30,35,33,49,52,29,74,28,85,59,51,65,66,18,13,86,75,6,44,38,43,94,56,95,96,71,50,81,90,89,16,17,4,88,79,77,68,97,92,67,72,53,2,10,31,32,80,111,104,93,26,8,61,5,73,70,63,20,60,40,41,23,22,48,36,108,99,64,62,55,69,19,46,47,15,54,100,101,27,21,12,102,105,109,112,113,114,115,116,119,120,121,122,123,124,9,127,24,130,132,129,125,131,118,117,133,134}'::bigint[]))))
SubPlan 2
-> CTE Scan on companyevents (cost=0.00..1.00 rows=50 width=90) (actual time=206.998..207.071 rows=50 loops=1)
SubPlan 3
-> CTE Scan on companyevents (cost=0.00..1.00 rows=50 width=8) (actual time=0.001..0.026 rows=50 loops=1)
Total runtime: 3026.410 ms
对于具有288个transactionIds的companyId,查询计划如下:
Limit (cost=7545.99..7664.33 rows=250 width=130) (actual time=30.976..3790.362 rows=54 loops=1)
CTE companyevents
-> Limit (cost=7543.62..7543.74 rows=50 width=130) (actual time=9.263..9.290 rows=50 loops=1)
-> Sort (cost=7543.62..7544.69 rows=429 width=130) (actual time=9.263..9.272 rows=50 loops=1)
Sort Key: events1.datetime
Sort Method: top-N heapsort Memory: 24kB
-> Nested Loop (cost=10.02..7529.37 rows=429 width=130) (actual time=0.071..8.195 rows=1025 loops=1)
-> Append (cost=10.02..1140.62 rows=657 width=8) (actual time=0.060..1.348 rows=1025 loops=1)
-> Bitmap Heap Scan on events_eventdetails compdetails (cost=10.02..394.47 rows=97 width=8) (actual time=0.021..0.021 rows=0 loops=1)
Recheck Cond: (((keyname)::text = 'companyId'::text) AND ("substring"(value, 0, 32) = '5'::text))
-> Bitmap Index Scan on events_eventdetails_substring_ind (cost=0.00..10.00 rows=97 width=0) (actual time=0.019..0.019 rows=0 loops=1)
Index Cond: (((keyname)::text = 'companyId'::text) AND ("substring"(value, 0, 32) = '5'::text))
-> Index Scan using events_eventdetails_companyid_substring_ind on events_eventdetails_companyid compdetails (cost=0.00..746.15 rows=560 width=8) (actual time=0.039..1.006 rows=1025 loops=1)
Index Cond: (((keyname)::text = 'companyId'::text) AND ("substring"(value, 0, 32) = '5'::text))
-> Index Scan using events_events_pkey on events_events events1 (cost=0.00..9.71 rows=1 width=130) (actual time=0.005..0.006 rows=1 loops=1025)
Index Cond: (id = compdetails.event_id)
Filter: (eventtype_id = ANY ('{103,106,107,110,45,34,14,87,58,78,7,76,42,11,25,57,98,37,30,35,33,49,52,29,74,28,85,59,51,65,66,18,13,86,75,6,44,38,43,94,56,95,96,71,50,81,90,89,16,17,4,88,79,77,68,97,92,67,72,53,2,10,31,32,80,111,104,93,26,8,61,5,73,70,63,20,60,40,41,23,22,48,36,108,99,64,62,55,69,19,46,47,15,54,100,101,27,21,12,102,105,109,112,113,114,115,116,119,120,121,122,123,124,9,127,24,130,132,129,125,131,118,117,133,134}'::bigint[]))
-> Index Scan Backward using events_events_datetime_ind on events_events (cost=2.25..1337132.75 rows=2824764 width=130) (actual time=30.975..3790.332 rows=54 loops=1)
Filter: ((hashed SubPlan 2) OR ((hashed SubPlan 3) AND (eventtype_id = ANY ('{103,106,107,110,45,34,14,87,58,78,7,76,42,11,25,57,98,37,30,35,33,49,52,29,74,28,85,59,51,65,66,18,13,86,75,6,44,38,43,94,56,95,96,71,50,81,90,89,16,17,4,88,79,77,68,97,92,67,72,53,2,10,31,32,80,111,104,93,26,8,61,5,73,70,63,20,60,40,41,23,22,48,36,108,99,64,62,55,69,19,46,47,15,54,100,101,27,21,12,102,105,109,112,113,114,115,116,119,120,121,122,123,124,9,127,24,130,132,129,125,131,118,117,133,134}'::bigint[]))))
SubPlan 2
-> CTE Scan on companyevents (cost=0.00..1.00 rows=50 width=90) (actual time=9.266..9.327 rows=50 loops=1)
SubPlan 3
-> CTE Scan on companyevents (cost=0.00..1.00 rows=50 width=8) (actual time=0.001..0.019 rows=50 loops=1)
Total runtime: 3796.736 ms
使用3s / 4s这一点都不错,但仍然是100+因素太慢。此外,这不是相关的硬件。尽管如此,它应该表明痛苦在哪里。
这可能会成为一个解决方案:
添加了一个表格:
events_transaction_helper
(
event_id bigint not null
transactionid character varying(36) not null
keyname character varying(255) not null
value bigint not null
# index on keyname, value
)
我现在“手动”填充了这个表,但物化视图实现可以解决问题。它将遵循以下查询:
SELECT tr.event_id, tr.value AS transactionid, det.keyname, det.value AS value
FROM events_eventdetails tr
JOIN events_eventdetails det ON det.event_id = tr.event_id
WHERE tr.keyname = 'transactionId'
AND det.keyname
IN ('companyId', 'reduction_id', 'customer_id');
在events_events表中添加了一个列:
transaction_id character varying(36) null
这个新专栏填写如下:
update events_events
set transaction_id =
(select value from events_eventdetails
where keyname='transactionId'
and event_id=events_events.id);
现在,以下查询一致返回< 15ms:
explain analyze select * from events_events
where transactionId in
(select distinct transactionid
from events_transaction_helper
WHERE keyname='companyId' and value=5)
and eventtype_id in (...)
order by datetime desc limit 250;
Limit (cost=5075.23..5075.85 rows=250 width=130) (actual time=8.901..9.028 rows=250 loops=1)
-> Sort (cost=5075.23..5077.19 rows=785 width=130) (actual time=8.900..8.953 rows=250 loops=1)
Sort Key: events_events.datetime
Sort Method: top-N heapsort Memory: 81kB
-> Nested Loop (cost=57.95..5040.04 rows=785 width=130) (actual time=0.928..8.268 rows=524 loops=1)
-> HashAggregate (cost=52.30..52.42 rows=12 width=37) (actual time=0.895..0.991 rows=276 loops=1)
-> Subquery Scan on "ANY_subquery" (cost=52.03..52.27 rows=12 width=37) (actual time=0.558..0.757 rows=276 loops=1)
-> HashAggregate (cost=52.03..52.15 rows=12 width=37) (actual time=0.556..0.638 rows=276 loops=1)
-> Index Scan using testmaterializedviewkeynamevalue on events_transaction_helper (cost=0.00..51.98 rows=22 width=37) (actual time=0.068..0.404 rows=288 loops=1)
Index Cond: (((keyname)::text = 'companyId'::text) AND (value = 5))
-> Bitmap Heap Scan on events_events (cost=5.65..414.38 rows=100 width=130) (actual time=0.023..0.024 rows=2 loops=276)
Recheck Cond: ((transactionid)::text = ("ANY_subquery".transactionid)::text)
Filter: (eventtype_id = ANY ('{103,106,107,110,45,34,14,87,58,78,7,76,42,11,25,57,98,37,30,35,33,49,52,29,74,28,85,59,51,65,66,18,13,86,75,6,44,38,43,94,56,95,96,71,50,81,90,89,16,17,4,88,79,77,68,97,92,67,72,53,2,10,31,32,80,111,104,93,26,8,61,5,73,70,63,20,60,40,41,23,22,48,36,108,99,64,62,55,69,19,46,47,15,54,100,101,27,21,12,102,105,109,112,113,114,115,116,119,120,121,122,123,124,9,127,24,130,132,129,125,131,118,117,133,134}'::bigint[]))
-> Bitmap Index Scan on testtransactionid (cost=0.00..5.63 rows=100 width=0) (actual time=0.020..0.020 rows=2 loops=276)
Index Cond: ((transactionid)::text = ("ANY_subquery".transactionid)::text)
Total runtime: 9.122 ms
我会稍后再回来告诉您这是否真的可行解决方案:)
答案 0 :(得分:1)
理念不来反规范化,但规范化。 events_details()表可以替换为两个表:一个包含event_detail_types,另一个包含实际值(参考{even_id,detail_types}。 这将使查询的执行更容易,因为只需提取和选择detail_types的数字id。增益在于DBMS必须提取的页面数量减少,因为只需要存储+检索所有关键名称+比较一次。
注意:我稍微更改了命名。出于理智和安全的原因,主要是。
SET search_path='cav';
/**** ***/
DROP SCHEMA cav CASCADE;
CREATE SCHEMA cav;
SET search_path='cav';
CREATE TABLE event_types
(
-- this table holds some 50 rows
id bigserial PRIMARY KEY
, zname varchar(255)
);
INSERT INTO event_types(zname)
SELECT 'event_'::text || gs::text
FROM generate_series (1,100) gs
;
CREATE TABLE events
(
-- this table holds some 15M rows
id bigserial PRIMARY KEY
, zdatetime timestamp with time zone
, eventtype_id bigint REFERENCES event_types(id)
);
INSERT INTO events(zdatetime,eventtype_id)
SELECT gs, et.id
FROM generate_series ('2012-04-11 00:00:00'::timestamp
, '2012-04-12 12:00:00'::timestamp ,' 1 hour'::interval ) gs
, event_types et
;
-- SELECT * FROM event_types;
-- SELECT * FROM events;
CREATE TABLE event_details
(
-- this table holds some 65M rows
id bigserial PRIMARY KEY
, event_id bigint REFERENCES events(id)
, keyname varchar(255)
, zvalue text
);
INSERT INTO event_details(event_id, keyname)
SELECT ev.id,im.*
FROM events ev
, (VALUES ('transactionId'::text),('someKey'::text)
,('reductionId'::text),('customerId'::text),('companyId'::text)
) im
;
UPDATE event_details
SET zvalue = 'Some_value'::text || (random() * 1000)::int::text
;
--
-- Domain table with all valid detail_types
--
CREATE TABLE detail_types(
id bigserial PRIMARY KEY
, keyname varchar(255)
);
INSERT INTO detail_types(keyname)
SELECT DISTINCT keyname
FROM event_details
;
--
-- Context-attribute-value table, referencing {event_id, type_id}
--
CREATE TABLE event_detail_values
( event_id BIGINT
, detail_type_id BIGINT
, zvalue text
, PRIMARY KEY(event_id , detail_type_id)
, FOREIGN KEY(event_id ) REFERENCES events(id)
, FOREIGN KEY(detail_type_id)REFERENCES detail_types(id)
);
--
-- For the sake of joining we create some natural keys
--
CREATE INDEX events_details_keyname ON event_details (keyname) ;
CREATE INDEX detail_types_keyname ON detail_types(keyname) ;
INSERT INTO event_detail_values (event_id,detail_type_id, zvalue)
SELECT ed.event_id, dt.id
, ed.zvalue
FROM event_details ed
, detail_types dt
WHERE ed.keyname = dt.keyname
;
--
-- Now we can drop the original table, and use the view instead
--
DROP TABLE event_details;
CREATE VIEW event_details AS (
SELECT dv.event_id AS event_id
, dt.keyname AS keyname
, dv.zvalue AS zvalue
FROM event_detail_values dv
JOIN detail_types dt ON dt.id = dv.detail_type_id
);
EXPLAIN ANALYZE
SELECT ev.id AS event_id
, ev.zdatetime AS zdatetime
, ed.keyname AS keyname
, ed.zvalue AS zevalue
FROM events ev
JOIN event_details ed ON ed.event_id = ev.id
WHERE ed.keyname IN ('transactionId','customerId','companyId')
ORDER BY event_id,keyname
;
生成的查询计划:
QUERY PLAN
----------------------------------------------------------------------------------------------------------------------------------------------
Sort (cost=1178.79..1197.29 rows=7400 width=40) (actual time=159.902..177.379 rows=11100 loops=1)
Sort Key: ev.id, dt.keyname
Sort Method: external sort Disk: 560kB
-> Hash Join (cost=108.34..703.22 rows=7400 width=40) (actual time=12.225..122.231 rows=11100 loops=1)
Hash Cond: (dv.event_id = ev.id)
-> Hash Join (cost=1.09..466.47 rows=7400 width=32) (actual time=0.047..74.183 rows=11100 loops=1)
Hash Cond: (dv.detail_type_id = dt.id)
-> Seq Scan on event_detail_values dv (cost=0.00..322.00 rows=18500 width=29) (actual time=0.006..26.543 rows=18500 loops=1)
-> Hash (cost=1.07..1.07 rows=2 width=19) (actual time=0.025..0.025 rows=3 loops=1)
Buckets: 1024 Batches: 1 Memory Usage: 1kB
-> Seq Scan on detail_types dt (cost=0.00..1.07 rows=2 width=19) (actual time=0.009..0.014 rows=3 loops=1)
Filter: ((keyname)::text = ANY ('{transactionId,customerId,companyId}'::text[]))
-> Hash (cost=61.00..61.00 rows=3700 width=16) (actual time=12.161..12.161 rows=3700 loops=1)
Buckets: 1024 Batches: 1 Memory Usage: 131kB
-> Seq Scan on events ev (cost=0.00..61.00 rows=3700 width=16) (actual time=0.004..5.926 rows=3700 loops=1)
Total runtime: 192.724 ms
(16 rows)
正如你所看到的,"最深的"在给定字符串列表的情况下,查询的一部分是detail_type_ids的检索。将其放入哈希表中,然后将其与detail_values的相应哈希值组合。 (注意:这是pg-9.1)
因人而异。
答案 1 :(得分:0)
如果必须沿着这些行使用设计,则应从events_eventdetails中删除id列,并将主键声明为(event_id,keyname)。这将为您提供一个非常有用的索引,而不会为合成密钥维护无用的索引。
更好的一步是完全消除events_eventdetails表,并使用带有GIN索引的数据的hstore列。这可能会让您达到性能目标,而无需预先定义存储的事件详细信息。
更好的是,如果您可以预测或指定可能的事件详细信息,则不会尝试在数据库中实现数据库。将每个“keyname”值设置为events_eventdetails中的一列,其中的数据类型与该数据的性质相适应。这可能会允许更快的访问速度,因为需要发布ALTER TABLE
语句,因为细节的性质会发生变化。
答案 2 :(得分:0)
请参阅,如果您的密钥(在这种情况下为reductionId
)超过events_eventdetails
表中所有行的7-10%,则PostgreSQL将更喜欢SeqScan。你无能为力,它是最快的方式。
我有类似的案例使用ISO8583数据包。每个数据包由128个字段组成(按设计),因此第一个数据库设计遵循您的方法,使用2个表:
field_id
和一个表格中的说明(在您的情况下为events_events
),field_id
+ field_value
在另一个(events_eventdetails
)。虽然这样的布局遵循3NF,但我们马上就遇到了同样的问题:
在你的情况下,你应该去重新设计。一个选项(更简单的一个)是使events_eventdetails.keyname
成为smallint
,这将使比较操作更快。虽然不是一场大胜。
另一种选择是将2个表减少为一个表,如:
CREATE TABLE events_events (
id bigserial,
datetime timestamp with time zone,
eventtype_id bigint,
transactionId text, -- value for transactionId
reductionId text, -- -"- reductionId
companyId text, -- etc.
customerId text,
anyotherId text,
...
);
将打破3NF,但另一方面:
可能的缺点:
unused fields / 8
bytes per row 修改强>
我不太明白你在这里实现的意思。
在你提出的问题中你提到了:
“解决方案”,在单个结果集和FAST中将events_events行100和200和300一起返回!当被要求提供reductionId = 123或当被问及customerId = 234或被问及companyId = 345时。
建议的重新设计会从events_eventdetails
创建交叉表或数据透视表。
要获得满足条件的所有events_events行,您可以使用:
SELECT *
FROM events_events
WHERE id IN (100, 200, 300)
AND reductionId = 123
-- AND customerId = 234
-- AND companyId = 345;