以下查询正在大约400万行上运行。前两个CTE语句在大约一个小时内执行。然而,最后一个有望持续超过15年。
WITH parsed AS (
SELECT name, array(...) description FROM import
), counts AS (
SELECT unnest(description) token, count(*) FROM parsed GROUP BY 1
)
INSERT INTO table (name, description)
SELECT name, ARRAY(
SELECT ROW(token, count)::a
FROM (
SELECT token, (
SELECT count
FROM counts
WHERE a.token=counts.token
)
FROM UNNEST(description) a(token)
) _
)::a[] description
FROM parsed;
QUERY PLAN
-----------------------------------------------------------------------------------------------------------------------------------------------
Insert on table (cost=55100824.40..162597717038.41 rows=3611956 width=96)
CTE parsed
-> Seq Scan on import (cost=0.00..51425557.67 rows=3611956 width=787)
Filter: ((name IS NOT NULL) AND (description IS NOT NULL))
SubPlan 1
-> HashAggregate (cost=11.59..12.60 rows=101 width=55)
-> Append (cost=0.00..11.34 rows=101 width=55)
-> Result (cost=0.00..0.01 rows=1 width=0)
-> Index Scan using import_aliases_mid_idx on import_aliases (cost=0.00..10.32 rows=100 width=56)
Index Cond: (mid = "substring"(import.mid, 5))
SubPlan 2
-> HashAggregate (cost=0.78..1.30 rows=100 width=0)
-> Result (cost=0.00..0.53 rows=100 width=0)
CTE counts
-> HashAggregate (cost=3675165.23..3675266.73 rows=20000 width=32)
-> CTE Scan on parsed (cost=0.00..1869187.23 rows=361195600 width=32)
-> CTE Scan on parsed (cost=0.00..162542616214.01 rows=3611956 width=96)
SubPlan 6
-> Function Scan on unnest a (cost=0.00..45001.25 rows=100 width=32)
SubPlan 5
-> CTE Scan on counts (cost=0.00..450.00 rows=100 width=8)
Filter: (a.token = token)
parsed
和counts
分别有大约400万行。查询当前正在运行,最终语句大约每2分钟插入一行。它几乎没有触及磁盘,但是像疯了一样吃CPU,我很困惑。
查询有什么问题?
最终语句应该在description
中查找counts
的每个元素,将此类[a,b,c]
转换为类似[(a,9),(b,4),(c,0)]
的内容并插入它。
的修改
通过解析并计算为表格,并将token
计入索引,这是计划:
explain INSERT INTO table (name, mid, description) SELECT name, mid, ARRAY(SELECT ROW(token, count)::a FROM (SELECT token, (SELECT count FROM counts WHERE a.token=counts.token) FROM UNNEST(description) a(token)) _)::a[] description FROM parsed;
QUERY PLAN
------------------------------------------------------------------------------------------------------
Insert on table (cost=0.00..5761751808.75 rows=4002061 width=721)
-> Seq Scan on parsed (cost=0.00..5761751808.75 rows=4002061 width=721)
SubPlan 2
-> Function Scan on unnest a (cost=0.00..1439.59 rows=100 width=32)
SubPlan 1
-> Index Scan using counts_token_idx on counts (cost=0.00..14.39 rows=1 width=4)
Index Cond: (a.token = token)
哪个更合理。这些数组平均有57个元素,所以我猜这只是针对可能相当低效的CTE表的查找数量,它正在扼杀性能。它现在每秒300行,我很高兴。
答案 0 :(得分:2)
正如我对该问题的编辑中所述,解析并计为表,并且计数中的令牌被索引的速度要快得多。我假设CTE加入比它们更聪明。
答案 1 :(得分:1)
所以你正在取消和重新组合4M阵列,对吗?
我的猜测是你正在处理内存耗尽,所以我认为你有几个选择。第一种是在表之间分阶段移动数据以最小化这个问题。
您能判断它是CPU绑定还是I / O绑定?