6 GB表上的昂贵查询

时间:2018-07-20 14:56:45

标签: sql postgresql heroku rdms

我有Foo和Buzz表,如下所示:

Foos
buzz_id
date

Foo在外键buzz_id上有一个索引。它还具有日期索引。

Buzzes
name
group

Buzz具有名称索引,组索引和组合名称和组的多列唯一索引。嗡嗡声有很多Foo。

我正在进行以下查询,这花费了太多时间:

SELECT DISTINCT ON (foos.buzz_id) foos.id, foos.date, buzzes.name, buzzes.group FROM foos INNER JOIN buzzes ON buzzes.id = foos.buzz_id 
WHERE (buzzes.group = ANY (ARRAY_OF_GROUPS) 
AND buzzes.name = ANY (ARRAY_OF_NAMES)
AND foos.date <= GIVEN_DATE) ORDER BY foos.buzz_id DESC, foos.date DESC;

我要在外键上连接两个表,并尝试获取每个buzz_id的foo的最高日期(允许在给定的数组中包含嗡嗡声的名称和组,并且满足日期条件)。

我有两个运行查询的环境,即本地计算机和Heroku环境。如您所见,我的本地环境中的表较小:

Local:
 foos                                        | r          |   4.013e+06 | 639 MB
 foos_pkey                                   | i          | 4.19832e+06 | 198 MB
 index_foos_on_buzz_id                       | i          | 4.19832e+06 | 285 MB
 index_foos_on_date                          | i          | 4.19832e+06 | 330 MB
 buzzes                                      | r          |        2298 | 184 kB
 index_buzzes_on_name_and_group              | i          |        2298 | 120 kB
 index_buzzes_on_group                       | i          |        2298 | 104 kB
 index_buzzes_on_name                        | i          |        2298 | 88 kB

Heroku:
 foos                                        | r          | 4.92772e+07 | 6653 MB 
 foos_pkey                                   | i          | 4.90556e+07 | 3151 MB
 index_foos_on_buzz_id                       | i          | 4.90556e+07 | 2462 MB
 index_foos_on_date                          | i          | 4.90556e+07 | 2421 MB
 buzzes                                      | r          |      328250 | 24 MB
 index_buzzes_on_name_and_group              | i          |      328250 | 10200 kB
 index_buzzes_on_group                       | i          |      328250 | 8624 kB
 index_buzzes_on_name                        | i          |      328250 | 7224 kB

我的本​​地表缺少数据,因此查询返回的行少于Heroku环境。

我在ARRAY_OF_NAMES中有很多商品,在这种情况下为500,而在ARRAY_OF_GROUPS中则相对较少,比如说4。

我的Heroku环境没有足够的RAM来将我的所有数据保存在缓存中,因此我知道目前查询的速度比所有缓存都慢。

使用EXPLAIN ANALYZE运行查询将为我提供以下输出:

Local:
 Unique  (cost=330087.91..336514.17 rows=1485 width=46) (actual time=3602.511..4131.322 rows=736 loops=1)
   ->  Sort  (cost=330087.91..333301.04 rows=1285252 width=46) (actual time=3602.509..4003.598 rows=1404653 loops=1)
         Sort Key: foos.buzz_id DESC, foos.date DESC
         Sort Method: external merge  Disk: 96096kB
         ->  Hash Join  (cost=311.50..160136.33 rows=1285252 width=46) (actual time=10.815..1438.885 rows=1404653 loops=1)
               Hash Cond: (foos.buzz_id = buzz.id)
               ->  Seq Scan on foos  (cost=0.00..131923.55 rows=4013004 width=32) (actual time=1.728..925.871 rows=4186572 loops=1)
                     Filter: (date <= GIVEN_DATE)
               ->  Hash  (cost=301.48..301.48 rows=801 width=18) (actual time=9.035..9.035 rows=736 loops=1)
                     Buckets: 1024  Batches: 1  Memory Usage: 47kB
                     ->  Index Scan using index_buzzes_on_name on buzzes  (cost=0.28..301.48 rows=801 width=18) (actual time=0.057..8.189 rows=736 loops=1)
                           Index Cond: ((name)::text = ANY (ARRAY_OF_NAMES::text[]))
                           Filter: ((group)::text = ANY (ARRAY_OF_GROUPS::text[]))
                           Rows Removed by Filter: 5
 Planning time: 5.804 ms
 Execution time: 4151.021 ms
(16 rows)

Heroku:

Unique  (cost=1086348.46..1086579.42 rows=17073 width=44) (actual time=64428.256..64980.542 rows=1467 loops=1)
   ->  Sort  (cost=1086348.46..1086463.94 rows=230962 width=44) (actual time=64428.254..64801.540 rows=1889788 loops=1)
         Sort Key: foos.buzz_id DESC, foos.date DESC
         Sort Method: external merge  Disk: 129240kB
         ->  Gather  (cost=3860.54..1082233.34 rows=230962 width=44) (actual time=20.290..61998.689 rows=1889788 loops=1)
               Workers Planned: 1
               Workers Launched: 1
               ->  Hash Join  (cost=2860.54..1058137.14 rows=135860 width=44) (actual time=16.240..61607.831 rows=944894 loops=2)
                     Hash Cond: (foos.buzz_id = buzz.id)
                     ->  Parallel Seq Scan on foos  (cost=0.00..953099.09 rows=28986598 width=32) (actual time=0.312..59310.999 rows=24527783 loops=2)
                           Filter: (date <= GIVEN_DATE)
                     ->  Hash  (cost=2855.15..2855.15 rows=1539 width=16) (actual time=9.022..9.022 rows=1467 loops=2)
                           Buckets: 2048  Batches: 1  Memory Usage: 90kB
                           ->  Bitmap Heap Scan on buzzes  (cost=973.85..2855.15 rows=1539 width=16) (actual time=6.534..8.753 rows=1467 loops=2)
                                 Recheck Cond: (((group)::text = ANY (ARRAY_OF_GROUPS::text[])) AND ((name)::text = ANY (ARRAY_OF_NAMES::text[])))
                                 Heap Blocks: exact=902
                                 ->  BitmapAnd  (cost=973.85..973.85 rows=1539 width=0) (actual time=6.416..6.416 rows=0 loops=2)
                                       ->  Bitmap Index Scan on index_buzzes_on_group  (cost=0.00..87.51 rows=10174 width=0) (actual time=1.059..1.059 rows=10504 loops=2)
                                             Index Cond: ((group)::text = ANY (ARRAY_OF_GROUPS::text[]))
                                       ->  Bitmap Index Scan on index_buzzes_on_name  (cost=0.00..886.14 rows=49668 width=0) (actual time=5.168..5.168 rows=50042 loops=2)
                                             Index Cond: ((name)::text = ANY (ARRAY_OF_NAMES::text[]))
 Planning time: 1.993 ms
 Execution time: 64999.534 ms

您对加快查询速度有任何建议吗?或者这是预期的行为?我想我一定做错了,因为我衷心怀疑我是否将postgres推到了性能的边缘。

2 个答案:

答案 0 :(得分:0)

如果您只需要从foo获取最长日期,则汇总查询会更合适。它还可能会更快,值得一试:

select
  b.id,
  b.name,
  b."group",
  max(f.date)
  from buzz b,
       foo f
 where b.id = f.buzz_id
   and b."group" in [ARRAY]
   and b.name in [ARRAY]
  and f.date < NOW()
 group by b.id;

关于索引:如果这是唯一要在该数据库上运行的查询,则单列索引(名称和组)将无济于事,因为您的条件在两个地方都使用了它们。如果没有其他查询仅基于这些列之一进行连接或过滤,则可以删除这两个索引以加快插入/更新的速度。

答案 1 :(得分:0)

也将WHERE条件移动到INNER JOIN。这样,您的数据集将尽早减少。理想情况下,您可能需要执行以下操作以尽快减少数据(并且一定要避免交叉连接):

IndSale` DECIMAL(10,2) GENERATED ALWAYS AS TotalSale/Quantity