分区数据的Postgres性能问题

时间:2018-06-25 11:32:20

标签: postgresql performance schema

我正在重新设计数据库方案,以提高查询性能。在新设计中,每年每月有5个表(在下面的示例中使用3个表)划分为分区(对于测试用例,在172个分区中共有860个表)。使用适当的索引类型和运算符类为相关字段建立索引。数据库中装有模拟数据,这些模拟数据是可以在生产环境中出现的合理数据。数据几乎永远不会更新,一旦存储就只能读取。

    表格中的
  • 1000万行
  • 表material_data中有1.6亿行
  • 表process_data中的40M行

硬件和软件配置:

Windows 10 Professional 64bit
Intel Core i7-4790CPU 
1 TB SATA HDD
16 GB RAM
PostgreSQL 11beta 1

Postgres配置(postgresql.conf):

shared_buffers = 512MB
temp_buffers   = 32MB
work_mem       = 32MB
maintenance_work_mem = 1GB

max_worker_processes = 8
max_parallel_workers = 8
max_parallel_workers_per_gather = 2

enable_partition_pruning = on
enable_parallel_append = on
constraint_exclusion = partition
default_statistics_target = 500
effective_cache_size = 12GB

数据库架构:

table measurements (10M records total):
id serial
guid TEXT NOT NULL (index: btree, text_pattern_ops)
start TIMESTAMP(0) WITHOUT TIME ZONE NOT NULL (index: btree)
stop TIMESTAMP(0) WITHOUT TIME ZONE NOT NULL 
mount_point_id SMALLINT NOT NULL (index: btree)
name TEXT NOT NULL
comment TEXT NOT NULL
PARTITION BY RANGE (start)

table process_data (40M records total):
id serial
mount_point_id SMALLINT NOT NULL (index:btree)
measurement_id INTEGER NOT NULL (index: btree)
measurement_start TIMESTAMP WITHOUT TIME ZONE NOT NULL (index: btree)
item_id SMALLINT NOT NULL (index: btree( item_id, item_value) )
item_value REAL NOT NULL
PARTITION BY RANGE (measurement_start)

table material_data (160M records total):
id serial,
mount_point_id SMALLINT NOT NULL (index: btree)
measurement_id INTEGER NOT NULL  (index: btree)
measurement_start TIMESTAMP WITHOUT TIME ZONE NOT NULL (index: btree)
material_index SMALLINT NOT NULL (index: btree)
material_data TEXT NOT NULL (index: btree, text_pattern_ops)
PARTITION BY RANGE (measurement_start)

Table relations:
measurements 1 ---+--- 1..N process_data
                  +--- 1..N material_data
                  +--- 1..N ...

这些是基表,为清楚起见,我提供了索引信息。实际上,索引适用于各个分区表。

partition tables (data given for one partition):
partition_2018_06_measurements: 60K records
partition_2018_06_process_data: 240K record
partition_2018_06_material_data: 950K records

常见查询是:

  • 选择给定时间间隔内的所有测量结果
  • 选择具有特定uuid(或uuid的一部分)的所有测量结果
  • 选择具有某些过程数据项的所有测量值
  • 选择具有某些material_data项的所有测量

我用不同数量的测量记录和统计目标进行了一些测试(表中的测量和统计目标分别为100,250,500,750和1000,从10K到10M记录。总共有20种不同的方案,并且每种方案的结果都具有可比性)情况下,较高的统计目标会带来更好的结果。

用于测试的SQL查询:

DROP VIEW IF EXISTS view_measurements;
DROP VIEW IF EXISTS view_material;
DROP VIEW IF EXISTS view_process;

CREATE TEMPORARY VIEW view_measurements AS
(
   SELECT * FROM 
      measurements m 
   WHERE
          m.start BETWEEN '2018-06-01 00:00:00' AND '2018-07-01 00:00:00'
      AND m.mount_point_id IN( 1,3,5,7,9,11,13,15,17,19 )
);

CREATE TEMPORARY VIEW view_material AS
(
   SELECT 
      md.measurement_id, 
      md.material_index, 
      md.material_data 
   FROM 
      material_data md 
   WHERE
      -- exclude as many rows as possible
          md.measurement_start BETWEEN '2018-06-01 00:00:00' AND '2018-07-01 00:00:00'
      AND md.mount_point_id IN( 1,3,5,7,9,11,13,15,17,19 )
      AND (md.material_data LIKE 'SHX%' OR md.material_data LIKE 'CU23%')
);

CREATE TEMPORARY VIEW view_process AS
(
   SELECT 
      pd.measurement_id, 
      pd.item_id, 
      pd.item_value 
   FROM 
      process_data pd
   WHERE
      -- exclude as many rows as possible
          pd.measurement_start BETWEEN '2018-06-01 00:00:00' AND '2018-07-01 00:00:00'
      AND pd.mount_point_id IN( 1,3,5,7,9,11,13,15,17,19 )
      AND pd.item_id IN ( 110, 111 )
);

--EXPLAIN ANALYZE VERBOSE
SELECT
   *
FROM
   view_measurements vm
WHERE
(
  (
    EXISTS( SELECT 1 FROM view_material md WHERE vm.id = md.measurement_id AND md.material_data LIKE 'SHX%' )  OR
    EXISTS( SELECT 1 FROM view_material md WHERE vm.id = md.measurement_id AND md.material_data LIKE 'CU23%' )
  )
  AND
  (
    EXISTS( SELECT 1 FROM view_process pd WHERE vm.id = pd.measurement_id AND pd.item_id = 110 AND pd.item_value > 1700 ) AND
    EXISTS( SELECT 1 FROM view_process pd WHERE vm.id = pd.measurement_id AND pd.item_id = 111 AND pd.item_value > 2.2 )
  )
);

上面的查询选择了从01.06.2018到01.07.2018的所有度量值

- a material item starting with 'SHX' or there is an material_item starting with 'CU23' AND
- a process data item with id 110 and value > 1700 AND
- a process data item with id 110 and value > 2.2

用于测量行。该查询返回了18个项目。

上面的查询有时需要花费1分钟的时间从未准备好的数据库中进行。这似乎太慢了,尤其是当所有数据都恰好来自3个表时(该间隔恰好适合分区2018_06)。将数据加载到数据库缓存后,具有类似参数的查询将在几百毫秒内返回。我对较大的分区(季度与月份)运行了相同的查询,而初始查询花费的时间甚至更长(2分钟而不是1分钟)。 query plan optimizer 显示查询计划者对行的估计比实际结果(项目10和11)小200x / 400x。

我尝试使用CTE代替视图,但是时间更糟。

  • 是否可以加快对未缓存数据的查询?
  • 设计中是否存在重大缺陷需要修复?
  • 是否有更好的架构设计?使用上面的架构,可以创建视图而无需联接来自另一个表的数据,这应该明显更快。

预先感谢您, 圭多

0 个答案:

没有答案