Question

我在我当地的win7机器上将一个月的Reddit评论拉到postgres9.5数据库中。

它有7,660万行9个文本列。

一旦我完成创建（花了大约2.5小时），即使是最简单的查询也会永远。选择count(id) from comments运行五分钟后再吐出答案。

我在'id'上创建了一个索引，但这几乎没有加速select count(id)查询。

这是Postgres的简单默认安装。

我打算使用ts_vectors对此进行文字搜索。但即使简单的行数计算需要很长时间，也可以考虑更复杂的查询。

我的问题是，如果可能的话，我应该在哪里寻找设置/配置以改善这一点。或者这是一个太大的文件（20gb数据库），以便postgres“快速”运行查询？

-- auto-generated definition
CREATE TABLE comments
(
  id           TEXT NOT NULL
    CONSTRAINT comments_pkey
    PRIMARY KEY,
  body         TEXT,
  author       TEXT,
  subreddit_id TEXT,
  link_id      TEXT,
  parent_id    TEXT,
  score        TEXT,
  subreddit    TEXT
);
COMMENT ON TABLE comments IS 'Reddit Comments';

这是解释最简单的任务：

postgres=# explain (analyze, buffers) select count(id) from comments;
                                                           QUERY PLAN
    --------------------------------------------------------------------------------
     Aggregate  (cost=2368842.45..2368842.46 rows=1 width=8) (actual time=243275.182..243275.183 rows=1 loops=1)
       Buffers: shared hit=996371 read=1273823
       ->  Seq Scan on comments  (cost=0.00..2349112.76 rows=7891876 width=8) (actual time=0.002..236984.770 rows=70609487 loops=1)
             Buffers: shared hit=996371 read=1273823
     Planning time: 0.060 ms
     Execution time: 243275.209 ms
    (6 rows)

RE：评论中的评论贬低我没有规范化表和我的数据类型。由于我主要对body列的文本分析感兴趣，因此其他列只是我的元数据。我不确定为什么我想让架构更复杂，因为我不需要它。并且注释的id是一个字符串。因此我使用文本列作为数据类型。

我还查看了“tuning settings”建议与我的默认设置。 shared_buffer设置（现在128MB，可能是512MB）据说在Windows上无关紧要。我还可以在16GB win7机器上将effective_cache_size从4GB增加到8GB。但我没有看到太多其他内容。

====编辑：我跑了vacuum, analyze comments，然后重新开始简单select count(*) from comments

postgres=# explain (analyze, buffers) select count(id) from comments;
                                                                          QUERY    PLAN
-------------------------------------------------------------------------------- -----------------------------------------------------------------------------
 Aggregate  (cost=2541076.54..2541076.55 rows=1 width=8) (actual time=326085.763..326085.763 rows=1 loops=1)
   Buffers: shared hit=276729 read=721831
   ->  Index Only Scan using comments_pkey on comments  (cost=0.57..2364549.52 rows=70610808 width=8) (actual time=122.348..321066.327 rows=70609487 loops=1)
         Heap Fetches: 16408753
         Buffers: shared hit=276729 read=721831
 Planning time: 0.402 ms
 Execution time: 326085.788 ms
(7 rows)


postgres=# select relname, last_vacuum, last_autovacuum, last_analyze, last_auto
analyze FROM pg_stat_all_tables where schemaname = 'public';

 relname  |         last_vacuum          |        last_autovacuum        |         last_analyze          |       last_autoanalyze
----------+------------------------------+-------------------------------+-------------------------------+-------------------------------
 posts    |                              | 2017-10-08 10:10:08.060342+08 |                               | 2017-10-08 11:53:20.025502+08
 comments | 2017-10-08 21:55:10.26989+08 | 2017-10-08 12:01:57.429096+08 | 2017-10-08 22:12:27.708228+08 | 2017-10-08 14:46:58.541407+08
(3 rows)

更新

关于驾驶的问题：

似乎D：harddrive 非常慢！我将数据库移动到C：并且select count(*) from comments查询花了10秒而不是5分钟。（我不知道这个驱动器有什么问题。基准测试也是永远的）

几个查询结果：

SELECT pg_size_pretty(pg_relation_size('comments_pkey'))  # 1516 MB
SELECT pg_size_pretty(pg_relation_size('comments'))       # 17 GB

# avg=7, stddev=0, max=7, min=7
SELECT avg(octet_length(id)), stddev(octet_length(id)), max(octet_length(id)), min(octet_length(id)) from comments

重新运行原始查询

postgres=# EXPLAIN (ANALYZE, BUFFERS) select count(id) from comments;

QUERY PLAN
--------------------------------------------------------------------------------
 Aggregate  (cost=2010570.45..2010570.46 rows=1 width=8) (actual time=11200.379.
.11200.379 rows=1 loops=1)
   Buffers: shared hit=20715 read=193157
   ->  Index Only Scan using comments_pkey on comments  (cost=0.57..1834253.61 r
ows=70526736 width=8) (actual time=0.035..7085.082 rows=70609487 loops=1)
         Heap Fetches: 0
         Buffers: shared hit=20715 read=193157
 Planning time: 0.062 ms
 Execution time: 11200.404 ms
(7 rows)

大型本地postgres数据库 - 注定的差事？

0 个答案: