我有一个具有以下结构的postgres表:
+---------+-------------+-------------+----------+---------+---------+
| id | timestamp | numvalues | text1 | text2 | text3 |
+---------+-------------+-------------+----------+---------+---------+
|abcd12344| 4124135 |[1,2,53,1241]| apple | banana | papaya |
+---------+-------------+-------------+----------+---------+---------+
id - random alphanumeric value<br>
timestamp - epoch timestamp<br>
numvalues - array of integers<br>
text(n) - text values
该表有大约1.5亿行。
我进行内部查询以获得在数组中的一个值上排序的数据的第n个百分位数。然后,我需要从数组中获取几个值的平均值。数组的长度约为31。
SELECT AVG(COALESCE(numvalues[2], 0))::NUMERIC(10,0), AVG(COALESCE(numvalues[3], 0))::NUMERIC(10,0)...AVG(COALESCE(numvalues[12], 0))::NUMERIC(10,0)
FROM (SELECT timestamp, numvalues, ntile(100)
OVER (ORDER BY numvalues[1]) pval FROM tablename WHERE timestamp >= somevalue and timestamp <= somevalue) innertable
WHERE pval >= x and pval <= y;
这在1.5亿行中返回约700万行,约占该表的5%。但是,它不使用索引。运行EXPLAIN ANALYZE表示它使用Seq Scan,即使enable_seqscan设置为off也是如此。
然而,类似的查询:
SELECT text1, count(distinct(id))
FROM (SELECT timestamp, id, text1, numvalues, ntile(100)
OVER (ORDER BY numvalues[1]) pval FROM tablename WHERE timestamp >= somevalue and timestamp <= somevalue) innertable
WHERE pval >= x and pval <= y GROUP BY text1;
确实使用索引。
索引位于时间戳列
EXPLAIN ANALYZE的结果:
explain analyze select text1, count(distinct(id))
from (select timestamp, text1, numvalues, id, ntile(100) over (order by numvalues[1]) pval from table where timestamp >= 1431100800 and timestamp <= 1431108000 and numvalues[1] NOTNULL) innertable
where pval >= 90 and pval <= 90 group by text1;
QUERY PLAN
--------------------------------------------------------------------------------
GroupAggregate (cost=4554118.06..4554641.66 rows=1 width=28) (actual time=218641.221..219051.984 rows=20 loops=1)
-> Sort (cost=4554118.06..4554292.59 rows=69812 width=28) (actual time=218640.546..218728.294 rows=71441 loops=1)
Sort Key: innertable.text1
Sort Method: quicksort Memory: 8654kB
-> Subquery Scan on innertable (cost=4094722.75..4548501.27 rows=69812 width=28) (actual time=216502.946..218521.666 rows=71441 loops=1)
Filter: ((innertable.pval >= 90) AND (innertable.pval <= 90))
Rows Removed by Filter: 7072674
-> WindowAgg (cost=4094722.75..4339065.03 rows=13962416 width=118) (actual time=202276.333..211374.235 rows=7144115 loops=1)
-> Sort (cost=4094722.75..4129628.79 rows=13962416 width=118) (actual time=164912.487..190272.316 rows=7144115 loops=1)
Sort Key: (table.numvalues[9])
Sort Method: external merge Disk: 1387704kB
-> Index Scan using table_time_idx on table (cost=0.57..1578710.87 rows=13962416 width=118) (actual time=0.124..141014.505 rows=7144115 loops=1)
Index Cond: (("timestamp" >= 1431100800) AND ("timestamp" <= 1431108000))
Filter: (numvalues[1] IS NOT NULL)
Rows Removed by Filter: 7090075
Total runtime: 219340.709 ms
(16 rows)
explain analyze select avg(coalesce(numvalues[9], 0))::NUMERIC(10,0) mean9, avg(coalesce(numvalues[30],0))::NUMERIC(10,0), avg(coalesce(numvalues[8],0))::NUMERIC(10,0) - avg(coalesce(numvalues[30], 0))::NUMERIC(10,0) mean0, avg(coalesce(numvalues[10],0))::NUMERIC(10,0) - avg(coalesce(numvalues[8], 0))::NUMERIC(10,0) mean1, avg(coalesce(numvalues[14],0))::NUMERIC(10,0) - avg(coalesce(numvalues[10], 0))::NUMERIC(10,0) mean2, avg(coalesce(numvalues[13],0))::NUMERIC(10,0) - avg(coalesce(numvalues[14], 0))::NUMERIC(10,0) mean3, avg(coalesce(numvalues[9],0))::NUMERIC(10,0) - avg(coalesce(numvalues[13], 0))::NUMERIC(10,0) mean4
from (select timestamp, id, numvalues, ntile(100) over (order by numvalues[1] ) pval from table where timestamp >= 1431093600.00 and timestamp <= 1431100800.00 and numvalues[9] NOTNULL) innerTable
where pval >= 90.00 and pval <= 90.00 ;
QUERY PLAN
--------------------------------------------------------------------------------
Aggregate (cost=12662077.32..12662077.37 rows=1 width=82) (actual time=650343.769..650343.770 rows=1 loops=1)
-> Subquery Scan on innertable (cost=12634854.54..12661968.84 rows=3615 width=82) (actual time=647745.962..650232.725 rows=71441 loops=1)
Filter: (((innertable.pval)::numeric >= 90.00) AND ((innertable.pval)::numeric <= 90.00))
Rows Removed by Filter: 7072674
-> WindowAgg (cost=12634854.54..12647507.88 rows=723048 width=248) (actual time=632388.293..642338.237 rows=7144115 loops=1)
-> Sort (cost=12634854.54..12636662.16 rows=723048 width=248) (actual time=599893.771..617413.102 rows=7144115 loops=1)
Sort Key: (table.numvalues[9])
Sort Method: external merge Disk: 3214248kB
-> Seq Scan on req_p0swajch2t (cost=0.00..12480460.48 rows=723048 width=248) (actual time=0.041..575423.062 rows=7144115 loops=1)
Filter: ((numvalues[1] IS NOT NULL) AND (("timestamp")::numeric >= 1431100800.00) AND (("timestamp")::numeric <= 1431108000.00))
Rows Removed by Filter: 138191935
Total runtime: 650785.126 ms
(12 rows)
任何人都可以帮我解决为什么Postgres会为一个查询而不是另一个查询使用索引?运行VACUUM ANALYZE也没有帮助。
有没有办法加快查询速度?对整个表格的查询大约需要20-30分钟!分区没有太大区别,因为查询跨越多个分区,并且当跨越的分区数变得更大时,它只显示了几分钟的改进。我有一个具有以下结构的postgres表:
+---------+-------------+-------------+----------+---------+---------+
| id | timestamp | numvalues | text1 | text2 | text3 |
+---------+-------------+-------------+----------+---------+---------+
|abcd12344| 4124135 |[1,2,53,1241]| apple | banana | papaya |
+---------+-------------+-------------+----------+---------+---------+
id - random alphanumeric value<br>
timestamp - epoch timestamp<br>
numvalues - array of integers<br>
text(n) - text values
该表有大约1.5亿行。
我进行内部查询以获得在数组中的一个值上排序的数据的第n个百分位数。然后,我需要从数组中获取几个值的平均值。数组的长度约为31。
SELECT AVG(COALESCE(numvalues[2], 0))::NUMERIC(10,0), AVG(COALESCE(numvalues[3], 0))::NUMERIC(10,0)...AVG(COALESCE(numvalues[12], 0))::NUMERIC(10,0)
FROM (SELECT timestamp, numvalues, ntile(100)
OVER (ORDER BY numvalues[1]) pval FROM tablename WHERE timestamp >= somevalue and timestamp <= somevalue) innertable
WHERE pval >= x and pval <= y;
这在1.5亿行中返回约700万行,约占该表的5%。但是,它不使用索引。运行EXPLAIN ANALYZE表示它使用Seq Scan,即使enable_seqscan设置为off也是如此。
然而,类似的查询:
SELECT text1, count(distinct(id))
FROM (SELECT timestamp, id, text1, numvalues, ntile(100)
OVER (ORDER BY numvalues[1]) pval FROM tablename WHERE timestamp >= somevalue and timestamp <= somevalue) innertable
WHERE pval >= x and pval <= y GROUP BY text1;
确实使用索引。
索引位于时间戳列
EXPLAIN ANALYZE的结果:
explain analyze select text1, count(distinct(id))
from (select timestamp, text1, numvalues, id, ntile(100) over (order by numvalues[1]) pval from table where timestamp >= 1431100800 and timestamp <= 1431108000 and numvalues[1] NOTNULL) innertable
where pval >= 90 and pval <= 90 group by text1;
QUERY PLAN
--------------------------------------------------------------------------------
GroupAggregate (cost=4554118.06..4554641.66 rows=1 width=28) (actual time=218641.221..219051.984 rows=20 loops=1)
-> Sort (cost=4554118.06..4554292.59 rows=69812 width=28) (actual time=218640.546..218728.294 rows=71441 loops=1)
Sort Key: innertable.text1
Sort Method: quicksort Memory: 8654kB
-> Subquery Scan on innertable (cost=4094722.75..4548501.27 rows=69812 width=28) (actual time=216502.946..218521.666 rows=71441 loops=1)
Filter: ((innertable.pval >= 90) AND (innertable.pval <= 90))
Rows Removed by Filter: 7072674
-> WindowAgg (cost=4094722.75..4339065.03 rows=13962416 width=118) (actual time=202276.333..211374.235 rows=7144115 loops=1)
-> Sort (cost=4094722.75..4129628.79 rows=13962416 width=118) (actual time=164912.487..190272.316 rows=7144115 loops=1)
Sort Key: (table.numvalues[9])
Sort Method: external merge Disk: 1387704kB
-> Index Scan using table_time_idx on table (cost=0.57..1578710.87 rows=13962416 width=118) (actual time=0.124..141014.505 rows=7144115 loops=1)
Index Cond: (("timestamp" >= 1431100800) AND ("timestamp" <= 1431108000))
Filter: (numvalues[1] IS NOT NULL)
Rows Removed by Filter: 7090075
Total runtime: 219340.709 ms
(16 rows)
explain analyze select avg(coalesce(numvalues[9], 0))::NUMERIC(10,0) mean9, avg(coalesce(numvalues[30],0))::NUMERIC(10,0), avg(coalesce(numvalues[8],0))::NUMERIC(10,0) - avg(coalesce(numvalues[30], 0))::NUMERIC(10,0) mean0, avg(coalesce(numvalues[10],0))::NUMERIC(10,0) - avg(coalesce(numvalues[8], 0))::NUMERIC(10,0) mean1, avg(coalesce(numvalues[14],0))::NUMERIC(10,0) - avg(coalesce(numvalues[10], 0))::NUMERIC(10,0) mean2, avg(coalesce(numvalues[13],0))::NUMERIC(10,0) - avg(coalesce(numvalues[14], 0))::NUMERIC(10,0) mean3, avg(coalesce(numvalues[9],0))::NUMERIC(10,0) - avg(coalesce(numvalues[13], 0))::NUMERIC(10,0) mean4
from (select timestamp, id, numvalues, ntile(100) over (order by numvalues[1] ) pval from table where timestamp >= 1431093600.00 and timestamp <= 1431100800.00 and numvalues[9] NOTNULL) innerTable
where pval >= 90.00 and pval <= 90.00 ;
QUERY PLAN
--------------------------------------------------------------------------------
Aggregate (cost=12662077.32..12662077.37 rows=1 width=82) (actual time=650343.769..650343.770 rows=1 loops=1)
-> Subquery Scan on innertable (cost=12634854.54..12661968.84 rows=3615 width=82) (actual time=647745.962..650232.725 rows=71441 loops=1)
Filter: (((innertable.pval)::numeric >= 90.00) AND ((innertable.pval)::numeric <= 90.00))
Rows Removed by Filter: 7072674
-> WindowAgg (cost=12634854.54..12647507.88 rows=723048 width=248) (actual time=632388.293..642338.237 rows=7144115 loops=1)
-> Sort (cost=12634854.54..12636662.16 rows=723048 width=248) (actual time=599893.771..617413.102 rows=7144115 loops=1)
Sort Key: (table.numvalues[9])
Sort Method: external merge Disk: 3214248kB
-> Seq Scan on req_p0swajch2t (cost=0.00..12480460.48 rows=723048 width=248) (actual time=0.041..575423.062 rows=7144115 loops=1)
Filter: ((numvalues[1] IS NOT NULL) AND (("timestamp")::numeric >= 1431100800.00) AND (("timestamp")::numeric <= 1431108000.00))
Rows Removed by Filter: 138191935
Total runtime: 650785.126 ms
(12 rows)
任何人都可以帮我解决为什么Postgres会为一个查询而不是另一个查询使用索引?运行VACUUM ANALYZE也没有帮助。
有没有办法加快查询速度?对整个表格的查询大约需要20-30分钟!分区没有太大区别,因为查询跨越多个分区,并且当跨越的分区数量变得更大时,它只显示了几分钟的改进。
答案 0 :(得分:3)
我猜您无法在评论中回复,所以我必须发布回答。
Explain analyze
显示,您的时间戳列与数值timestamp >= 1431093600.00 and timestamp <= 1431100800.00
进行比较,因此将它们转换为数字:
Filter: ((numvalues[1] IS NOT NULL) AND (("timestamp")::numeric >= 1431100800.00) AND (("timestamp")::numeric <= 1431108000.00))
调查发生这种情况的原因并尝试修复它。