没有错误,无限加载:我应该在哪里搜索?

时间:2017-03-16 13:32:58

标签: postgresql ubuntu logging ubuntu-16.04 postgresql-9.5

这是背景,它刚刚发生在今天:

  • Ubuntu 16.04,PostgreSQL 9.5
  • 查询使用PostgreSQL(没有消息)的任何页面时的无限加载
  • 使用psql(没有消息)运行任何查询时无限加载
  • /var/log/postgresql/postgresql-9.5-main.log
  • 中没有特别的内容
  • /var/log/syslog
  • 中没有特别的内容
  • 服务器负载正常(每个处理器约1%,内存几乎不使用,磁盘空间正常)

然后,在运行sudo service postgresql restart后,一切正常。

在这种情况下我应该在哪里搜索?这是一个常见的"问题?没有日志,我不知道该怎么做。我怎么能相对"肯定"它不会随机发生吗?

旁注:重新启动后,这是postgreSQL日志的详细信息:

2017-03-16 12:20:52 CET [857-2] LOG:  received fast shutdown request
2017-03-16 12:20:52 CET [857-3] LOG:  aborting any active transactions
2017-03-16 12:20:52 CET [6829-1] XXXXX@YYYY FATAL:  terminating connection due to administrator command at character 15
2017-03-16 12:20:52 CET [6829-2] XXXXX@YYYY STATEMENT:  select * from "users" where "users"."id" = $1 and "users"."deleted_at" is null limit 1
2017-03-16 12:20:52 CET [6644-1] XXXXX@YYYY FATAL:  terminating connection due to administrator command at character 15
2017-03-16 12:20:52 CET [6644-2] XXXXX@YYYY STATEMENT:  select * from "users" where "users"."id" = $1 and "users"."deleted_at" is null limit 1
2017-03-16 12:20:52 CET [6649-1] XXXXX@YYYY FATAL:  terminating connection due to administrator command at character 15
2017-03-16 12:20:52 CET [6649-2] XXXXX@YYYY STATEMENT:  select * from "users" where "users"."id" = $1 and "users"."deleted_at" is null limit 1
2017-03-16 12:20:52 CET [6580-1] XXXXX@YYYY FATAL:  terminating connection due to administrator command at character 15
2017-03-16 12:20:52 CET [6580-2] XXXXX@YYYY STATEMENT:  select * from "users" where "users"."id" = $1 and "users"."deleted_at" is null limit 1
2017-03-16 12:20:52 CET [6768-1] XXXXX@YYYY FATAL:  terminating connection due to administrator command at character 15
2017-03-16 12:20:52 CET [6768-2] XXXXX@YYYY STATEMENT:  select * from "users" where "email" = $1 and "users"."deleted_at" is null limit 1
2017-03-16 12:20:52 CET [6253-1] XXXXX@YYYY FATAL:  terminating connection due to administrator command
2017-03-16 12:20:52 CET [6253-2] XXXXX@YYYY STATEMENT:  alter table "users" add column "email_confirmed" boolean not null default '1'
2017-03-16 12:20:52 CET [8465-1] XXXXX@YYYY FATAL:  terminating connection due to administrator command
2017-03-16 12:20:52 CET [913-2] LOG:  autovacuum launcher shutting down
2017-03-16 12:20:52 CET [6586-1] XXXXX@YYYY FATAL:  terminating connection due to administrator command at character 15
2017-03-16 12:20:52 CET [6586-2] XXXXX@YYYY STATEMENT:  select * from "users" where "users"."id" = $1 and "users"."deleted_at" is null limit 1
2017-03-16 12:20:52 CET [31969-1] XXXXX@YYYY FATAL:  terminating connection due to administrator command
2017-03-16 12:20:52 CET [6300-1] XXXXX@YYYY FATAL:  terminating connection due to administrator command
2017-03-16 12:20:52 CET [6974-1] XXXXX@YYYY FATAL:  the database system is shutting down
2017-03-16 12:20:52 CET [906-1] LOG:  shutting down
2017-03-16 12:20:52 CET [6980-1] XXXXX@YYYY FATAL:  the database system is shutting down
2017-03-16 12:20:52 CET [6978-1] XXXXX@YYYY FATAL:  the database system is shutting down
2017-03-16 12:20:52 CET [6975-1] XXXXX@YYYY FATAL:  the database system is shutting down
2017-03-16 12:20:52 CET [6977-1] XXXXX@YYYY FATAL:  the database system is shutting down
2017-03-16 12:20:52 CET [6976-1] XXXXX@YYYY FATAL:  the database system is shutting down
2017-03-16 12:20:52 CET [6983-1] XXXXX@YYYY FATAL:  the database system is shutting down
2017-03-16 12:20:52 CET [6979-1] XXXXX@YYYY FATAL:  the database system is shutting down
2017-03-16 12:20:52 CET [6982-1] XXXXX@YYYY FATAL:  the database system is shutting down
2017-03-16 12:20:52 CET [6981-1] XXXXX@YYYY FATAL:  the database system is shutting down
2017-03-16 12:20:52 CET [6985-1] XXXXX@YYYY FATAL:  the database system is shutting down
2017-03-16 12:20:52 CET [6984-1] XXXXX@YYYY FATAL:  the database system is shutting down
2017-03-16 12:20:52 CET [906-2] LOG:  database system is shut down
2017-03-16 12:20:53 CET [7002-1] LOG:  database system was shut down at 2017-03-16 12:20:52 CET 

1 个答案:

答案 0 :(得分:3)

这是导致您出现问题的陈述:

alter table "users" add column "email_confirmed" boolean not null default '1'

这样的ALTER TABLE需要对表进行ACCESS EXCLUSIVE锁定,即使对于读访问也会发生冲突。

现在表users似乎是一个繁忙的表,通常没有问题(除非两个事务尝试写同一行),因为表锁定它们不需要彼此冲突。

现在来ALTER TABLE并且必须等到表上的所有先前事务(在表上至少保持ACCESS SHARE锁定)完成。 在 ALTER TABLE之后开始的表的所有查询都必须在其后面排队,因为他们需要的锁与ALTER TABLE所需的锁相冲突。

ALTER TABLE获得ACCESS EXCLUSIVE锁后,应该很快完成,事情会恢复正常。

现在还有一个谜语:你日志中所有其他被中断的查询看起来都是非常简短的查询,所以你不得不等待那么长时间并不明显 - 它们应该在几微秒内完成。

要解释这一点,请记住,在查询完成时不会释放锁定,但会保持锁定直到事务结束。因此,如果您在事务中运行短查询然后保持事务处于打开状态,则此空闲数据库会话阻止ALTER TABLE。在正常情况下,您可能不会注意到,特别是如果所有交易都是只读的,但在这种情况下,这将证明是有害的。

顺便说一句,还有其他一些负面影响,即保持打开的事务,特别是VACUUM无法正常工作,并且您的表和索引会变得臃肿。

您可以通过state系统视图中的pg_stat_activity来检查您的应用是否存在该问题。如果您可以定期查看idle in transaction的会话并保持一段时间,那么您正在查看问题的根本原因。这应该是固定的!

您也可以中断ALTER TABLE语句,而不是重新启动服务器。

如果您再次遇到此类问题,请查询pg_locks系统视图。您将看到运行ALTER TABLE的会话具有granted = FALSE的锁,因为它正在等待表锁。您还可以查看哪些其他会话正在锁定表格。