我有一个cron作业,每分钟运行一次Django管理命令。当命令启动时,它在存储在PostgreSQL数据库中的表中设置标志is_running = true
(使用pgbouncer作为连接池)。未来的进程也会检查此标志,以防止在当前运行时重新运行相同的任务。
有一天我注意到系统非常慢,所以运行ps aux | grep manage.py
显示了数百个这样的进程,似乎什么都不做。
然后我跑了ps aux|grep -i postgres
,我看到了同样多的行,如:
postgres: dbuser dbname 127.0.0.1(47095) UPDATE waiting
查询pg_stat_activity
会显示运行更新is_running = true
的查询的查询:
UPDATE "myapp_job" SET "is_running" = true WHERE "myapp_job"."id" = 32
这实际上是一个非常慢的内存泄漏,它也消耗了我所有的数据库连接。作为一个止损,我一直在杀死执行永远不会完成的更新的挂起postgres进程,但这并不能解决潜在的问题。
为什么这个简单的查询没有完成?它似乎陷入僵局,但我不明白为什么。除了其他更新查询之外,没有其他任何东西可以锁定该表,但是它们都不会超过几毫秒。
另外,为什么查询不会超时?我之前遇到过死锁错误,Django / Postgres通常会抛出一个明确的死锁错误,然后我可以捕获并使用它来重试该操作。但其中一些已经等了12个多小时。
我是否可以使用pgbouncer导致事务保持打开状态,从而防止我的表上的锁被释放?
显式查询pg_locks会在我的表上显示几个独占锁,以及几个停滞的进程:
SELECT relation::regclass,
locktype,
pid,
mode,
granted
FROM pg_locks
WHERE relation::regclass::varchar like 'myapp_job';
告诉我:
+-----------+----------+-------+------------------+---------+
| relation | locktype | pid | mode | granted |
+-----------+----------+-------+------------------+---------+
| myapp_job | relation | 1995 | AccessShareLock | t |
| myapp_job | relation | 1995 | RowExclusiveLock | t |
| myapp_job | tuple | 31497 | ExclusiveLock | t |
| myapp_job | relation | 5773 | AccessShareLock | t |
| myapp_job | relation | 1904 | AccessShareLock | t |
| myapp_job | relation | 1904 | RowExclusiveLock | t |
| myapp_job | relation | 1858 | AccessShareLock | t |
| myapp_job | relation | 1858 | RowExclusiveLock | t |
| myapp_job | relation | 32348 | RowShareLock | t |
| myapp_job | relation | 31497 | RowShareLock | t |
| myapp_job | tuple | 1995 | ExclusiveLock | f |
| myapp_job | tuple | 32348 | ExclusiveLock | f |
| myapp_job | tuple | 1858 | ExclusiveLock | f |
| myapp_job | tuple | 1904 | ExclusiveLock | f |
| myapp_job | tuple | 1950 | ExclusiveLock | f |
| myapp_job | relation | 1950 | AccessShareLock | t |
| myapp_job | relation | 1950 | RowExclusiveLock | t |
| myapp_job | relation | 5731 | AccessShareLock | t |
| myapp_job | relation | 5731 | RowShareLock | t |
+-----------+----------+-------+------------------+---------+
将pg_lock加入pg_stat_activity会显示哪些查询导致死锁:
SELECT pl.relation::regclass,
pl.locktype,
pl.pid,
pl.mode,
pl.granted,
pa.query_start,
pa.current_query as query___________________________________
FROM pg_locks as pl
inner join pg_stat_activity as pa on pa.procpid = pl.pid
WHERE pl.relation::regclass::varchar like 'myapp_job'
order by pa.query_start;
给了我:
+-----------+----------+------+------------------+---------+-------------------------------+-----------------------------------------------------------------+
| relation | locktype | pid | mode | granted | query_start | query |
+-----------+----------+------+------------------+---------+-------------------------------+-----------------------------------------------------------------+
| myapp_job | relation | 5731 | AccessShareLock | t | 2014-02-28 00:00:01.936118-05 | <IDLE> in transaction |
| myapp_job | relation | 5731 | RowShareLock | t | 2014-02-28 00:00:01.936118-05 | <IDLE> in transaction |
| myapp_job | relation | 5773 | AccessShareLock | t | 2014-02-28 07:33:37.967912-05 | <IDLE> in transaction |
| myapp_job | tuple | 3867 | ExclusiveLock | t | 2014-02-28 10:46:47.363178-05 | UPDATE myapp_job SET is_running = true WHERE myapp_job.id = 32 |
| myapp_job | relation | 3867 | RowExclusiveLock | t | 2014-02-28 10:46:47.363178-05 | UPDATE myapp_job SET is_running = true WHERE myapp_job.id = 32 |
| myapp_job | relation | 3867 | AccessShareLock | t | 2014-02-28 10:46:47.363178-05 | UPDATE myapp_job SET is_running = true WHERE myapp_job.id = 32 |
| myapp_job | tuple | 3893 | ExclusiveLock | f | 2014-02-28 10:47:01.860486-05 | UPDATE myapp_job SET is_running = true WHERE myapp_job.id = 32 |
| myapp_job | relation | 3893 | AccessShareLock | t | 2014-02-28 10:47:01.860486-05 | UPDATE myapp_job SET is_running = true WHERE myapp_job.id = 32 |
| myapp_job | relation | 3893 | RowExclusiveLock | t | 2014-02-28 10:47:01.860486-05 | UPDATE myapp_job SET is_running = true WHERE myapp_job.id = 32 |
| myapp_job | relation | 3932 | RowExclusiveLock | t | 2014-02-28 10:48:02.124961-05 | UPDATE myapp_job SET is_running = true WHERE myapp_job.id = 32 |
| myapp_job | relation | 3932 | AccessShareLock | t | 2014-02-28 10:48:02.124961-05 | UPDATE myapp_job SET is_running = true WHERE myapp_job.id = 32 |
| myapp_job | tuple | 3932 | ExclusiveLock | f | 2014-02-28 10:48:02.124961-05 | UPDATE myapp_job SET is_running = true WHERE myapp_job.id = 32 |
+-----------+----------+------+------------------+---------+-------------------------------+-----------------------------------------------------------------+
正如您所看到的,最早的ExclusiveLock已被授予,但已经运行了十多分钟,只是为了在一行上设置is_running = true
。
为什么会这样?