Question

我正在研究一个批量处理数据并填满PostgreSQL（9.6，但我可以升级）数据库的项目。它当前的工作方式是该过程在不同的步骤中进行，每个步骤都将数据添加到它拥有的表中（很少有两个进程在同一个表中写入，如果有的话，则在不同的列中写入）。

数据的方式，数据往往随着每一步变得越来越精细。作为简化示例，我有一个表定义数据源。极少数（数十/数百），但每个数据源都生成批量数据样本（批次和样本是单独的表，用于存储元数据）。每批通常产生约50k样品。然后逐步处理这些数据点中的每一个，并且每个数据样本在下一个表中生成更多数据点。

这很好，直到我们在样本表中得到1.5mil的行（从我们的观点来看这不是很多数据）。现在，批量过滤开始变慢（我们检索的每个样本大约10毫秒）。它开始成为一个主要的瓶颈，因为获取批量数据的执行时间需要5-10分钟（读取时间为ms）。

我们在这些查询涉及的所有外键上都有b-tree索引。

由于我们的计算以批量为目标，因此我通常不需要在计算期间跨批次查询（这是查询时间此刻受到很大伤害的时候）。但是，出于数据分析的原因，需要在批次之间进行临时查询。

因此，一个非常简单的解决方案是为每个批处理生成一个单独的数据库，并在需要时以某种方式查询这些数据库。如果我在每个数据库中只有一个批处理，显然单个批处理的过滤将是即时的，我的问题将得到解决（目前）。然而，那么我最终将拥有数以千计的数据库，数据分析会很痛苦。

在PostgreSQL中，有没有办法假装我有一些查询的单独数据库？理想情况下，当我＆＃34;注册＆＃34;新批次。

在PostgreSQL的世界之外，我应该为我的用例尝试另一个数据库吗？

编辑：DDL /架构

在我们当前的实现中，sample_representation是所有处理结果所依赖的表。批处理由（batch.id，representation.id）元组真正定义。我上面尝试和描述的查询速度很慢（每个样本10毫秒，50k样本加起来大约5分钟）

SELECT sample_representation.id, sample.sample_pos
FROM sample_representation
JOIN sample ON sample.id = sample_representation.id_sample
WHERE sample_representation.id_representation = 'representation-uuid' AND sample.id_batch = 'batch-uuid'

我们目前约有1.5 s sample s，2 representation s，460 batch es（其中49个已处理，其他没有与之关联的样本），这意味着每批次平均有30k个样本。有些人有大约5万。

架构如下。有一些与所有表关联的元数据，但在这种情况下我不是要查询它。实际的样本数据分别存储在磁盘上，而不是存储在数据库中，以防万一。

    create table batch
(
    id uuid default uuid_generate_v1mc() not null
        constraint batch_pk
            primary key,
    path text not null
        constraint unique_batch_path
            unique,
    id_data_source uuid
)
;
create table sample
(
    id uuid default uuid_generate_v1mc() not null
        constraint sample_pk
            primary key,
    sample_pos integer,
    id_batch uuid
        constraint batch_fk
            references batch
                on update cascade on delete set null
)
;
create index sample_sample_pos_index
    on sample (sample_pos)
;
create index sample_id_batch_sample_pos_index
    on sample (id_batch, sample_pos)

;
create table representation
(
    id uuid default uuid_generate_v1mc() not null
        constraint representation_pk
            primary key,
    id_data_source uuid
)
;
create table data_source
(
    id uuid default uuid_generate_v1mc() not null
        constraint data_source_pk
            primary key
)
;
alter table batch
    add constraint data_source_fk
        foreign key (id_data_source) references data_source
            on update cascade on delete set null
;
alter table representation
    add constraint data_source_fk
        foreign key (id_data_source) references data_source
            on update cascade on delete set null
;
create table sample_representation
(
    id uuid default uuid_generate_v1mc() not null
        constraint sample_representation_pk
            primary key,
    id_sample uuid
        constraint sample_fk
            references sample
                on update cascade on delete set null,
    id_representation uuid
        constraint representation_fk
            references representation
                on update cascade on delete set null
)
;
create unique index sample_representation_id_sample_id_representation_uindex
    on sample_representation (id_sample, id_representation)
;
create index sample_representation_id_sample_index
    on sample_representation (id_sample)
;
create index sample_representation_id_representation_index
    on sample_representation (id_representation)
;

Answer 1

在摆弄后，我找到了解决方案。但我仍然不确定为什么原始查询真的需要那么多时间：

SELECT sample_representation.id, sample.sample_pos
FROM sample_representation
JOIN sample ON sample.id = sample_representation.id_sample
WHERE sample_representation.id_representation = 'representation-uuid' AND sample.id_batch = 'batch-uuid'

所有内容都已编入索引，但表格相对较大，在sample_representation和sample中有150万行。我想会发生的事情是，首先将表连接起来，然后使用WHERE进行过滤。但即使由于连接而创建一个大视图，也不应该花那么长时间？！

无论如何，我试图使用CTE而不是加入两个“大规模”表。想法是提前过滤，然后加入：

WITH sel_samplerepresentation AS (
  SELECT *
  FROM sample_representation
  WHERE id_representation='1437a5da-e4b1-11e7-a254-7fff1955d16a'
  ), sel_samples AS (
  SELECT *
  FROM sample
  WHERE id_video='75c04b9c-e4b9-11e7-a93f-132baa27ac91'
)
SELECT sel_samples.sample_pos, sel_samplerepresentation.id
FROM sel_samplerepresentation
JOIN sel_samples ON sel_samples.id = sel_samplerepresentation.id_representation

这个查询也需要永远。原因很清楚。 sel_samples和sel_samplerepresentation各有5万条记录。连接发生在CTE的非索引列上。

由于没有CTE指数，我将它们重新制定为物化视图，我可以为其添加指数：

CREATE MATERIALIZED VIEW sel_samplerepresentation AS (
  SELECT *
  FROM sample_representation
  WHERE id_representation='1437a5da-e4b1-11e7-a254-7fff1955d16a'
  );

CREATE MATERIALIZED VIEW sel_samples AS (
  SELECT *
  FROM sample
  WHERE id_video = '75c04b9c-e4b9-11e7-a93f-132baa27ac91'
);

CREATE INDEX sel_samplerepresentation_sample_id_index ON sel_samplerepresentation (id_sample);
CREATE INDEX sel_samples_id_index ON sel_samples (id);

SELECT sel_samples.sample_pos, sel_samplerepresentation.id
FROM sel_samplerepresentation
JOIN sel_samples ON sel_samples.id = sel_samplerepresentation.id_sample;

DROP MATERIALIZED VIEW sel_samplerepresentation;
DROP MATERIALIZED VIEW sel_samples;

这更像是一个黑客而不是解决方案，但执行这些查询需要1秒！（从8分钟开始）

批处理/拆分PostgreSQL数据库

1 个答案: