Question

我目前正在研究数据跟踪系统。该系统是一个用Python编写的多进程应用程序，其工作方式如下：

每隔S秒从中选择N个最合适的任务数据库（目前是Postgres）并为其找到数据
如果没有任务，则创建N个新任务并返回（1）。

问题在于 - 目前我有约。 80GB的数据和36M的任务以及对tasks表的查询开始变得越来越慢（它是人口最多和最常用的表）。

性能的主要瓶颈是任务跟踪查询：

LOCK TABLE task IN ACCESS EXCLUSIVE MODE;
SELECT * FROM task WHERE line = 1 AND action = ANY(ARRAY['Find', 'Get']) AND (stat IN ('', 'CR1') OR stat = 'ERROR' AND (actiondate <= NOW() OR actiondate IS NULL)) ORDER BY taskid, actiondate, action DESC, idtype, date ASC LIMIT 36;

                                    Table "public.task"
   Column   |            Type             |                    Modifiers
------------+-----------------------------+-------------------------------------------------
 number     | character varying(16)       | not null
 date       | timestamp without time zone | default now()
 stat       | character varying(16)       | not null default ''::character varying
 idtype     | character varying(16)       | not null default 'container'::character varying
 uri        | character varying(1024)     |
 action     | character varying(16)       | not null default 'Find'::character varying
 reason     | character varying(4096)     | not null default ''::character varying
 rev        | integer                     | not null default 0
 actiondate | timestamp without time zone |
 modifydate | timestamp without time zone |
 line       | integer                     |
 datasource | character varying(512)      |
 taskid     | character varying(32)       |
 found      | integer                     | not null default 0
Indexes:
    "task_pkey" PRIMARY KEY, btree (idtype, number)
    "action_index" btree (action)
    "actiondate_index" btree (actiondate)
    "date_index" btree (date)
    "line_index" btree (line)
    "modifydate_index" btree (modifydate)
    "stat_index" btree (stat)
    "taskid_index" btree (taskid)

                               QUERY PLAN                          
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 Limit  (cost=312638.87..312638.96 rows=36 width=668) (actual time=1838.193..1838.197 rows=36 loops=1)
   ->  Sort  (cost=312638.87..313149.54 rows=204267 width=668) (actual time=1838.192..1838.194 rows=36 loops=1)
         Sort Key: taskid, actiondate, action, idtype, date
         Sort Method: top-N heapsort  Memory: 43kB
         ->  Bitmap Heap Scan on task  (cost=107497.61..306337.31 rows=204267 width=668) (actual time=1013.491..1343.751 rows=914586 loops=1)
               Recheck Cond: ((((stat)::text = ANY ('{"",CR1}'::text[])) OR ((stat)::text = 'ERROR'::text)) AND (line = 1))
               Filter: (((action)::text = ANY ('{Find,Get}'::text[])) AND (((stat)::text = ANY ('{"",CR1}'::text[])) OR (((stat)::text = 'ERROR'::text) AND ((actiondate <= now()) OR (actiondate IS NULL)))))
               Rows Removed by Filter: 133
               Heap Blocks: exact=76064
               ->  BitmapAnd  (cost=107497.61..107497.61 rows=237348 width=0) (actual time=999.457..999.457 rows=0 loops=1)
                     ->  BitmapOr  (cost=9949.15..9949.15 rows=964044 width=0) (actual time=121.936..121.936 rows=0 loops=1)
                           ->  Bitmap Index Scan on stat_index  (cost=0.00..9449.46 rows=925379 width=0) (actual time=117.791..117.791 rows=920900 loops=1)
                                 Index Cond: ((stat)::text = ANY ('{"",CR1}'::text[]))
                           ->  Bitmap Index Scan on stat_index  (cost=0.00..397.55 rows=38665 width=0) (actual time=4.144..4.144 rows=30262 loops=1)
                                 Index Cond: ((stat)::text = 'ERROR'::text)
                     ->  Bitmap Index Scan on line_index  (cost=0.00..97497.14 rows=9519277 width=0) (actual time=853.033..853.033 rows=9605462 loops=1)
                           Index Cond: (line = 1)
 Planning time: 0.284 ms
 Execution time: 1838.882 ms
(19 rows)

当然，所有涉及的字段都已编入索引。我目前正在考虑两个方向：

如何优化查询并且实际上是否会为我提供透视效果（目前每个查询大约需要10秒，这在动态任务跟踪中是不可接受的）
在哪里以及如何更有效地存储任务数据 - 可能我应该使用另一个数据库用于此类目的 - Cassandra，VoltDB或其他大数据存储？

我认为数据应该以某种方式预先排序，以便尽快获得实际任务。

另请注意，我目前的80G音量最有可能是最小值而不是最大值。

提前致谢！

Answer 1

我不太了解您的用例，但它并不像我的索引工作得太好。看起来查询主要依赖于stat索引。我认为你需要研究一个复合索引，比如（action，line，stat）。

另一种选择是在多个表格上分割您的数据，将其分成具有低基数的某个键。我不使用postgres但我不认为看另一个数据库解决方案会更好地工作，除非你确切知道你正在优化什么。

用于系统任务跟踪的高效数据库解决方案

1 个答案: