Heroku网络工作者因工人造成的超时

时间:2014-04-22 12:57:44

标签: ruby heroku ruby-on-rails-3.2

我在Heroku上托管的应用程序中遇到问题(2xweb + 1x worder dynos)。该过程以创建并通过SendGrid发送的大量电子邮件结束。这需要一些时间,导致网络工作者超时和糟糕的可用性,所以它被重构为一个工人,我认为这将解决问题,但我得到这样的情况:

Apr 10 17:12:48 wc heroku/web.2:  Processing by DealsController#show as */* 
[request is processed]
Apr 10 17:12:50 wc app/worker.1:  [worker sending emails] 
[a lot of lines with debug data cut]
Apr 10 17:12:53 wc heroku/worker.1:  source=worker.1 dyno=heroku.16079351.858a3455-0b9f-4f75-9052-b419d4653703 sample#load_avg_1m=0.29 sample#load_avg_5m=0.07 sample#load_avg_15m=0.02 
Apr 10 17:12:53 wc heroku/worker.1:  source=worker.1 dyno=heroku.16079351.858a3455-0b9f-4f75-9052-b419d4653703 sample#memory_total=240.45MB sample#memory_rss=240.34MB sample#memory_cache=0.11MB sample#memory_swap=0.00MB sample#memory_pgpgin=85990pages sample#memory_pgpgout=24436pages 
Apr 10 17:12:53 wc heroku/web.2:  source=web.2 dyno=heroku.16079351.879182ef-13f0-4908-bf35-c487ccab6153 sample#load_avg_1m=0.00 sample#load_avg_5m=0.01 sample#load_avg_15m=0.01 
Apr 10 17:12:54 wc heroku/web.2:  source=web.2 dyno=heroku.16079351.879182ef-13f0-4908-bf35-c487ccab6153 sample#memory_total=844.16MB sample#memory_rss=511.82MB sample#memory_cache=0.00MB sample#memory_swap=332.34MB sample#memory_pgpgin=223581pages sample#memory_pgpgout=92554pages 
Apr 10 17:12:54 wc heroku/web.2:  Process running mem=844M(164.9%) 
Apr 10 17:12:54 wc heroku/web.2:  Error R14 (Memory quota exceeded) 
Apr 10 17:12:54 wc app/web.2:  ** [NewRelic][04/10/14 15:12:54 +0000 879182ef-13f0-4908-bf35-c487ccab6153 (468)] INFO : Starting Agent shutdown 
Apr 10 17:12:55 wc app/worker.1:  ** [Bugsnag] Bugsnag exception handler 1.6.2 ready, api_key=xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
[8 new relic lines cut]
Apr 10 17:12:56 wc app/heroku-postgres:  source=HEROKU_POSTGRESQL_WHITE sample#current_transaction=144542 sample#db_size=189720760bytes sample#tables=92 sample#active-connections=12 sample#waiting-connections=0 sample#index-cache-hit-rate=0.99981 sample#table-cache-hit-rate=0.99868 sample#load-avg-1m=0.36 sample#load-avg-5m=0.3 sample#load-avg-15m=0.285 sample#read-iops=38.367 sample#write-iops=13.221 sample#memory-total=7629464kB sample#memory-free=187884kB sample#memory-cached=6599816kB sample#memory-postgres=689216kB 
Apr 10 17:12:56 wc app/worker.1:  ** [NewRelic][04/10/14 15:12:55 +0000 858a3455-0b9f-4f75-9052-b419d4653703 (99)] INFO : Installing DelayedJob instrumentation hooks 
[11 new relic lines cut] 
Apr 10 17:12:59 wc app/worker.1:    Delayed::Backend::ActiveRecord::Job Load (1.6ms)  UPDATE "delayed_jobs" SET locked_at = '2014-04-10 15:12:59.482734', locked_by = 'host:858a3455-0b9f-4f75-9052-b419d4653703 pid:2' WHERE id IN (SELECT id FROM "delayed_jobs" WHERE ((run_at <= '2014-04-10 15:12:59.481969' AND (locked_at IS NULL OR locked_at < '2014-04-10 11:12:59.482002') OR locked_by = 'host:858a3455-0b9f-4f75-9052-b419d4653703 pid:2') AND failed_at IS NULL) ORDER BY priority ASC, run_at ASC LIMIT 1 FOR UPDATE) RETURNING * 
Apr 10 17:12:59 wc app/worker.1:  ** [NewRelic][04/10/14 15:12:59 +0000 858a3455-0b9f-4f75-9052-b419d4653703 (99)] INFO : Starting Agent shutdown 
Apr 10 17:13:09 wc app/worker.1:    Delayed::Backend::ActiveRecord::Job Load (1.6ms)  UPDATE "delayed_jobs" SET locked_at = '2014-04-10 15:13:09.486147', locked_by = 'host:858a3455-0b9f-4f75-9052-b419d4653703 pid:2' WHERE id IN (SELECT id FROM "delayed_jobs" WHERE ((run_at <= '2014-04-10 15:13:09.485453' AND (locked_at IS NULL OR locked_at < '2014-04-10 11:13:09.485486') OR locked_by = 'host:858a3455-0b9f-4f75-9052-b419d4653703 pid:2') AND failed_at IS NULL) ORDER BY priority ASC, run_at ASC LIMIT 1 FOR UPDATE) RETURNING * 
Apr 10 17:13:13 wc heroku/web.2:  source=web.2 dyno=heroku.16079351.879182ef-13f0-4908-bf35-c487ccab6153 sample#load_avg_1m=0.00 sample#load_avg_5m=0.01 sample#load_avg_15m=0.01 
Apr 10 17:13:14 wc heroku/web.2:  source=web.2 dyno=heroku.16079351.879182ef-13f0-4908-bf35-c487ccab6153 sample#memory_total=463.75MB sample#memory_rss=153.29MB sample#memory_cache=0.00MB sample#memory_swap=310.46MB sample#memory_pgpgin=245188pages sample#memory_pgpgout=205946pages 
Apr 10 17:13:14 wc app/web.2:  Started GET "/user" for [IP.IP.IP.IP] at 2014-04-10 15:13:13 +0000 
Apr 10 17:13:14 wc heroku/worker.1:  source=worker.1 dyno=heroku.16079351.858a3455-0b9f-4f75-9052-b419d4653703 sample#load_avg_1m=0.21 sample#load_avg_5m=0.06 sample#load_avg_15m=0.02 
Apr 10 17:13:14 wc heroku/worker.1:  source=worker.1 dyno=heroku.16079351.858a3455-0b9f-4f75-9052-b419d4653703 sample#memory_total=157.25MB sample#memory_rss=157.14MB sample#memory_cache=0.11MB sample#memory_swap=0.00MB sample#memory_pgpgin=103179pages sample#memory_pgpgout=62923pages 
Apr 10 17:13:16 wc app/web.2:  Started GET "/user" for [IP.IP.IP.IP] at 2014-04-10 15:13:16 +0000 
Apr 10 17:13:17 wc heroku/router:  at=error code=H12 desc="Request timeout" method=GET path=/pages/planning host=www.cool-app.com request_id=c62a7ee5-11d8-4286-846a-a55861cc6a0e fwd="[IP.IP.IP.IP]" dyno=web.2 connect=2ms service=30000ms status=503 bytes=0 
Apr 10 17:13:19 wc app/web.2:  E, [2014-04-10T15:13:18.948990 #2] ERROR -- : worker=1 PID:12 timeout (31s > 30s), killing 
Apr 10 17:13:19 wc app/worker.1:    Delayed::Backend::ActiveRecord::Job Load (1.5ms)  UPDATE "delayed_jobs" SET locked_at = '2014-04-10 15:13:19.489494', locked_by = 'host:858a3455-0b9f-4f75-9052-b419d4653703 pid:2' WHERE id IN (SELECT id FROM "delayed_jobs" WHERE ((run_at <= '2014-04-10 15:13:19.488845' AND (locked_at IS NULL OR locked_at < '2014-04-10 11:13:19.488874') OR locked_by = 'host:858a3455-0b9f-4f75-9052-b419d4653703 pid:2') AND failed_at IS NULL) ORDER BY priority ASC, run_at ASC LIMIT 1 FOR UPDATE) RETURNING * 
Apr 10 17:13:20 wc app/web.2:  E, [2014-04-10T15:13:19.689336 #2] ERROR -- : reaped #<Process::Status: pid 12 SIGKILL (signal 9)> worker=1 
Apr 10 17:13:21 wc app/web.2:  Disconnected from ActiveRecord 
Apr 10 17:13:27 wc app/web.2:  Processing by UsersController#show as HTML 
[web worker 2 works properly from now on]

web.2工作者仍然崩溃,导致用户大约20秒的挂起,并且&#34;应用程序错误&#34;屏幕显示。奇怪的是,这发生在不同的页面上,似乎与后台的工作人员有关。

特别让我感到困惑的一句话(可能是崩溃的症状)是:

Apr 10 17:13:19 wc app/web.2:  E, [2014-04-10T15:13:18.948990 #2] ERROR -- : worker=1 PID:12 timeout (31s > 30s), killing 

这是什么意思?在我看来,web.2 dyno被杀了,因为worker = 1有一个timout,这看起来有点疯狂。

dynos的配置是: Dyno config

Dynos:
1x - 2 - web bundle exec unicorn -p $PORT -c ./config/unicorn.rb
1x - 1 - worker bundle exec rake jobs:work

有什么想法吗?

1 个答案:

答案 0 :(得分:1)

解决方案:从delayed_job切换到sidekiq

对于它可能关心的人 - 尽管进行了几个小时的调试,但我真的无法深究这一点。决定尝试将这一项工作转为sidekiq。这解决了这个问题。可能只是特定delayed_job的错误 - &gt; heroku互动。也许这会在今天起作用......