Question

我有一个包含3664万条目的表格。表定义如下：

id integer, PK
attribute, varchar 255
value, varchar 255
store_id, integer
timestamp, timestamp without timezone
mac_address, varchar 255

plus，mac_address和timestamp列具有索引。

查询：

select count(*) from table where mac_address = $1 and timestamp between $2 and $3
select * from table where mac_address = $1 and timestamp between $2 and $3

如果我在pgAdmin中运行它，则总共需要10秒钟。如果我使用JPA运行它，则需要40秒以上。没有EAGER加载。

我已查看SimpleJpaRepository代码。正是这两个查询，count()和getResultList()

问题： 1.看起来pgAdmin和JPA都没有使用timestamp索引。我用ANALYZE和EXPLAIN检查了这个。但为什么？ 2.为什么JPA需要多10倍的时间？ ORM会增加开销，但会增加10倍？ 3.我该如何改进？

编辑1：

也许JPA的count()没有使用索引扫描，它使用sequential = slow。我的postgresql版本是9.5。

编辑2： 在JPA中，它使用setFirstResult()和setMaxResult()来获得总共100个条目。总计259242

我尝试使用 LIMIT 和 OFFSET 来模仿它，但我在JPA查询中没有看到这些关键字。也许JPA得到所有结果，然后在内存中进行分页，这又会导致性能问题？

使用pgAdmin首次执行count()查询需要19到55秒。

两个查询的EXPLAIN。

计数（）

Aggregate  (cost=761166.10..761166.11 rows=1 width=4) (actual time=1273.871..1273.871 rows=1 loops=1)
  Output: count(id)
  Buffers: shared read=92986 written=56
  ->  Bitmap Heap Scan on public.device_messages playerstat0_  (cost=11165.36..760309.47 rows=342650 width=4) (actual time=76.217..1258.389 rows=259242 loops=1)
        Output: id, attributecode, attributevalue, store_id, "timestamp", mac_address
        Recheck Cond: (((playerstat0_.mac_address)::text = '0011E004CA34'::text) AND (playerstat0_."timestamp" >= '2018-04-04 00:00:00'::timestamp without time zone) AND (playerstat0_."timestamp" <= '2018-05-04 00:00:00'::timestamp without time zone))
        Rows Removed by Index Recheck: 6281401
        Heap Blocks: exact=36622 lossy=55083
        Buffers: shared read=92986 written=56
        ->  Bitmap Index Scan on device_messages_mac_address_timestamp_idx  (cost=0.00..11079.70 rows=342650 width=0) (actual time=69.636..69.636 rows=259242 loops=1)
              Index Cond: (((playerstat0_.mac_address)::text = '0011E004CA34'::text) AND (playerstat0_."timestamp" >= '2018-04-04 00:00:00'::timestamp without time zone) AND (playerstat0_."timestamp" <= '2018-05-04 00:00:00'::timestamp without time zone))
              Buffers: shared read=1281
Planning time: 0.138 ms
Execution time: 1274.275 ms

选择

Limit  (cost=3362.52..5043.49 rows=100 width=34) (actual time=30.291..42.846 rows=100 loops=1)
  Output: id, attributecode, attributevalue, mac_address, store_id, "timestamp"
  Buffers: shared hit=15447 read=1676"
  ->  Index Scan Backward using device_messages_pkey on public.device_messages playerstat0_  (cost=0.57..5759855.56 rows=342650 width=34) (actual time=2.597..42.834 rows=300 loops=1)
        Output: id, attributecode, attributevalue, mac_address, store_id, "timestamp"
        Filter: ((playerstat0_."timestamp" >= '2018-04-04 00:00:00'::timestamp without time zone) AND (playerstat0_."timestamp" <= '2018-05-04 00:00:00'::timestamp without time zone) AND ((playerstat0_.mac_address)::text = '0011E004CA34'::text))
        Rows Removed by Filter: 154833
        Buffers: shared hit=15447 read=1676
Planning time: 0.180 ms
Execution time: 42.878 ms

编辑3： 经过更多测试后，确认原因是count（）。选择有限制和偏移相当快。仅计数（）可能需要一分钟。这里提到postgresql slow counting

虽然计数估算功能有效（来自查询计划的ROWS），但我无法从JPA中调用它。

编辑3： 我有点解决问题，但不完全解决。

关于选择，在创建与查询匹配的索引后，它实际上运行得非常快，2~5秒。但那没有分类。排序为查询添加了另一个流程步骤。

count（）很慢，并且由postgresql文件确认。 MVCC强制count（）进行堆扫描，类似于对整个表的序列扫描。

最后一个问题，我仍然不确定生产服务器上的查询是否比测试服务器慢。生产60秒，测试服务器5秒。具有相同的表大小和数据。但最大的区别是生产服务器每秒大约有20多次插入操作。测试服务器没有进行插入操作。我猜也许插入操作需要写锁定，所以查询很慢，因为它必须等待锁？

Answer 1

您应该能够在同一索引中使用mac_address和timestamp的索引获得更好的性能：

CREATE INDEX [CONCURRENTLY] ON table (mac_address, timestamp);

不使用timestamp索引的原因是因为它需要与mac_address索引交叉引用它以找到正确的行（这实际上需要的时间比直接查找行更长）

我没有使用JPA的经验，所以我不能说它为什么会变慢。

如何提高JPA / PostgreSQL查询性能？

1 个答案: