Question

我有两张桌子a和b。表a包含大约600,000行和6个文本列，表b包含大约30,000行和6个文本列。我正在尝试这样做

create table c as
select *
from a, b
where a.file_name between b.starting_file_name and b.ending_file_name;

我将file_name索引在a上，并且在b上单独编制索引的starting_file_name和ending_file_name。令人惊讶的是，我的HP Proliant ML350p服务器（64GB内存）需要1个多小时左右

以下是Postgres的一些其他配置：

shared_buffers = 16GB
work_mem = 1GB
maintenance_work_mem = 1GB
effective_cache_size = 32GB

说明：

Nested Loop (cost=0.00..261971798.23 rows=2685032391 width=250)" " 
Join Filter: (a.file_name >= b.starting_file_name)" " 
-> Seq Scan on a (cost=0.00..21144.88 rows=618988 width=162)" " 
-> Index Scan using b_ending_file_name_idx on b (cost=0.00..228.00 rows=13013 width=88)" " 
     Index Cond: (a.file_name<= b.end_file_name)"

也尝试了

create table c as
select *
from a, b
where a.file_name >=b.starting_file_name
and a.file_name<= b.ending_file_name;

以下是解释：

"Nested Loop  (cost=0.00..261971798.23 rows=2685032391 width=250)"
"  Join Filter: (a.file_name>= b.starting_file_name)"
"  ->  Seq Scan on a  (cost=0.00..21144.88 rows=618988 width=162)"
"  ->  Index Scan using b_ending_file_name_idx on b  (cost=0.00..228.00 rows=13013 width=88)"
"        Index Cond: (a.file_name<= b.end_file_name)"

任何建议都将不胜感激。

Answer 1

您可能会对(b.starting_file_name, b.ending_file_name)上的综合索引感到满意。

此外，如果字符串在第一个相对较短的字符数中通常是唯一的，则可以在子字符串上创建表达式索引，然后对整个字符串进行重新检查，例如

CREATE INDEX b_filename_prefixes ON b ( 
  left(starting_file_name, 20),
  right(ending_file_name, 20)
);

然后

select *
from a, b
where 
  left(a.file_name, 20) between left(b.starting_file_name, 20) and left(b.ending_file_name, 20)
  and a.file_name between b.starting_file_name and b.ending_file_name;

我已经在一些简单的样本数据上对此进行了测试，以确认规划人员会将该指数识别为候选人，并确实如此。

Postgresql查询速度慢：比较两个表中的2个文本列

1 个答案: