流星错误:结束后写

时间:2017-08-17 16:22:22

标签: mongodb nginx meteor

修改 看起来第二台服务器偶尔会出现这个错误,这让我几乎可以确定它是配置问题。可能是以下之一:

net.ipv4.tcp_fin_timeout = 2
net.ipv4.tcp_tw_recycle = 1
net.ipv4.tcp_tw_reuse =1

请求的版本信息:Meteor: 1.5.0 OS: Ubuntu 16.04 Provider: AWS EC2

我在一台服务器(一对)上运行的两个进程上间歇地,看似随机地得到以下错误。其他服务器永远不会收到此错误,错误没有引用我编写的任何代码,因此我只能假设其(a)Meteor中的错误或(b),我的服务器配置错误。进程崩溃的服务器还托管了另外两个流星站点,这两个站点偶尔会出现此错误:

Error: write after end
at writeAfterEnd (_stream_writable.js:167:12)
at PassThrough.Writable.write (_stream_writable.js:212:5)
at IncomingMessage.ondata (_stream_readable.js:542:20)
at emitOne (events.js:77:13)
at IncomingMessage.emit (events.js:169:7)
at IncomingMessage.Readable.read (_stream_readable.js:368:10)
at flow (_stream_readable.js:759:26)
at resume_ (_stream_readable.js:739:3)
at nextTickCallbackWith2Args (node.js:511:9)
at process._tickDomainCallback (node.js:466:17)

我已经检查过的事情:

  1. 内存限制(无处接近)
  2. 连接限制 - 非常小,在发生故障时每台服务器大约20个,并且流程在1分钟内被碰到第二台服务器,处理它们+它自己很好
  3. 进程限制 - 服务器1上的两个进程都在7分钟内失败。
  4. 服务器配置 - 当我试图在负载测试期间寻求一些额外的性能时,我根据我看到的高负载node.js服务器的帖子修改了sysctl.conf,这是故障服务器的内容sysctl.conf但是,正常运行的服务器具有相同的配置。
  5. fs.file-max = 1000000
    fs.nr_open = 1000000
    ifs.file-max = 70000
    net.nf_conntrack_max = 1048576
    net.ipv4.netfilter.ip_conntrack_max = 32768
    net.ipv4.tcp_fin_timeout = 2
    net.ipv4.tcp_max_orphans = 8192
    net.ipv4.ip_local_port_range = 16768    61000
    net.ipv4.tcp_max_syn_backlog = 10024
    net.ipv4.tcp_max_tw_buckets = 360000
    net.core.netdev_max_backlog = 2500
    net.ipv4.ip_local_port_range = 1024 65535
    net.ipv4.tcp_tw_recycle = 1
    net.ipv4.tcp_tw_reuse =1
    net.core.somaxconn = 20048
    

    我在server1上有一个NGINX平衡器,它在4个进程之间进行负载平衡(每个服务器2个)。 NGINX错误日志中包含如下行:

    2017/08/17 16:15:01 [warn] 1221#1221: *6233472 an upstream response is buffered to a temporary file /var/lib/nginx/proxy/1/46/0000029461 while reading upstream, client: 164.68.80.47, server: server redacted, request: "GET path redacted HTTP/1.1", upstream: "path redacted", host: "host redacted", referrer: "referrer redacted"

    在发生错误时,我看到一对这样的行:

    2017/08/17 15:07:19 [error] 1222#1222: *6215301 connect() failed (111: Connection refused) while connecting to upstream, client: ip redacted, server: server redacted, request: "GET /admin/sockjs/info?cb=o2ziavvsua HTTP/1.1", upstream: "http://127.0.0.1:8080/admin/sockjs/info?cb=o2ziavvsua", host: "hostname redacted", referrer: "referrer redacted"

    2017/08/17 15:07:19 [warn] 1222#1222: *6215301 upstream server temporarily disabled while connecting to upstream, client: ip redacted, server: server redacted, request: "GET /admin/sockjs/info?cb=o2ziavvsua HTTP/1.1", upstream: "http://127.0.0.1:8080/admin/sockjs/info?cb=o2ziavvsua", host: "hostname redacted", referrer: "referrer redacted"

    如果它很重要,我使用3节点mongo副本集,其中两个服务器都指向所有3个节点。

    我还使用自定义托管版本的kadira(因为它已脱机)。

    如果无法阻止错误,无论如何都要阻止他们取消整个过程,有时每个进程连接50-100个用户,因为一个错误似乎过多而启动它们

1 个答案:

答案 0 :(得分:0)

这是两天没有崩溃,所以我认为解决方案正在改变:

net.ipv4.tcp_fin_timeout = 2
net.ipv4.tcp_tw_recycle = 1
net.ipv4.tcp_tw_reuse = 1

net.ipv4.tcp_fin_timeout = 15
net.ipv4.tcp_tw_recycle = 0
net.ipv4.tcp_tw_reuse = 0

我不知道是哪一个导致问题(可能是超时)。我仍然认为这是一个“错误”,即单个“Write after end”错误会导致整个流星过程崩溃。也许这应该只是记录下来。