我让Jenkins在我们所有的app机器上运行部署脚本。最近,我的一半构建没有完成并且在尝试运行相同的东西时保持挂起。输出的最后一个看起来像这样:
** [app@app1 :: stdout] Generating Configuration to /var/www/app/releases/20130509192657/config/production.sphinx.conf
** [app@app2 :: stdout] Generating Configuration to /var/www/app/releases/20130509192657/config/production.sphinx.conf
** [app@app3 :: stdout] Generating Configuration to /var/www/app/releases/20130509192657/config/production.sphinx.conf
** [app@app4 :: stdout] Generating Configuration to /var/www/app/releases/20130509192657/config/production.sphinx.conf
** [app@app6 :: stdout] Generating Configuration to /var/www/app/releases/20130509192657/config/production.sphinx.conf
** [app@app7 :: stdout] Generating Configuration to /var/www/app/releases/20130509192657/config/production.sphinx.conf
app5始终是似乎遇到此问题的机器,它在尝试运行时发生:
/usr/local/bin/ruby /usr/local/bin/bundle exec rake db:migrate ts:conf
生产正在运行ruby 1.9.3p194,由于遗留原因,我们仍在运行ThinkingSphinx v.9.8.8。我们还运行Rails 3.2.13和ThinkingSphinx 2.0.7。
在悬挂过程中运行strace告诉我:
...
29802 select(4, [3], NULL, NULL, NULL <unfinished ...>
29790 restart_syscall(<... resuming interrupted call ...>) = -1 ETIMEDOUT (Connection timed out)
29790 futex(0x64a88e8, FUTEX_WAKE_PRIVATE, 1) = 0
29790 write(4, "!", 1 <unfinished ...>
29802 <... select resumed> ) = 1 (in [3])
29790 <... write resumed> ) = 1
29802 read(3, <unfinished ...>
29790 futex(0x1d47f64, FUTEX_WAIT_PRIVATE, 3, NULL <unfinished ...>
29802 <... read resumed> "!", 1024) = 1
29802 select(4, [3], NULL, NULL, {0, 100000}) = 0 (Timeout)
29802 select(4, [3], NULL, NULL, {0, 100000}) = 0 (Timeout)
29802 select(4, [3], NULL, NULL, {0, 100000}) = 0 (Timeout)
29802 select(4, [3], NULL, NULL, {0, 100000}) = 0 (Timeout)
29802 select(4, [3], NULL, NULL, {0, 100000}) = 0 (Timeout)
29802 select(4, [3], NULL, NULL, {0, 100000}) = 0 (Timeout)
29802 select(4, [3], NULL, NULL, {0, 100000}) = 0 (Timeout)
29802 select(4, [3], NULL, NULL, {0, 100000}) = 0 (Timeout)
29802 select(4, [3], NULL, NULL, {0, 100000}) = 0 (Timeout)
29802 select(4, [3], NULL, NULL, {0, 100000}) = 0 (Timeout)
29802 select(4, [3], NULL, NULL, {0, 100000}) = 0 (Timeout)
29802 select(4, [3], NULL, NULL, {0, 100000}) = 0 (Timeout)
29802 select(4, [3], NULL, NULL, {0, 100000}) = 0 (Timeout)
29802 select(4, [3], NULL, NULL, {0, 100000}) = 0 (Timeout)
29802 select(4, [3], NULL, NULL, {0, 100000}) = 0 (Timeout)
29802 select(4, [3], NULL, NULL, {0, 100000}) = 0 (Timeout)
29802 select(4, [3], NULL, NULL, {0, 100000}) = 0 (Timeout)
29802 select(4, [3], NULL, NULL, {0, 100000}) = 0 (Timeout)
29802 select(4, [3], NULL, NULL, {0, 100000}) = 0 (Timeout)
29802 select(4, [3], NULL, NULL, {0, 100000}) = 0 (Timeout)
...
有没有人见过这个?在sysops中没有太多的背景,我是否应该尝试解决这个问题?
答案 0 :(得分:0)
如果db:migrate
正在锁定,那么可能有一个活动的或挂起的 - 也许是僵尸 - 进程锁定迁移中引用的数据库表(或其他资源)。我最近经历过这种情况,其他工程师运行的数据修复脚本已经崩溃(在我尝试部署之前一周),但没有退出 - 持有一个打开的事务,阻止了对表的更改。对我们来说,修复只是终止卡住的过程,然后迁移正常工作。
在不了解您的系统架构的情况下,很难准确了解相互冲突的资源。您的rdbms工具包可能允许您查看服务器上托管的数据库,并查看打开的连接是什么。