Question

我们目前正在针对ElasticSearch集群的特定Kubernetes服务（LoadBalancer）的问题进行故障排除，该集群似乎在客户端非常随机地提供ECONNRESET。设置如下：

使用Kops部署的AWS上的Kubernetes 1.6
Elasticsearch 2.4集群（Fabric8发现插件部署）。它将自己公开为Service / LoadBalancer
NodeJS API部署充当客户端，使用request模块连接到上述Elasticsearch服务

运行足够的时间（不多，大约10-15分钟），不一定在很多负载下（每秒大约1个请求），ElasticSearch服务似乎变得不可用，这是来自API的症状ECONNRESET错误侧：

{ Error: connect ECONNRESET 100.69.12.100:9200 at 
 Object.exports._errnoException (util.js:953:11) at 
 exports._exceptionWithHostPort (util.js:976:20) at 
 TCPConnectWrap.afterConnect [as oncomplete] (net.js:1080:14) cause: { 
 Error: connect ECONNRESET 100.69.12.100:9200 at 
 Object.exports._errnoException (util.js:953:11) at 
 exports._exceptionWithHostPort (util.js:976:20) at 
 TCPConnectWrap.afterConnect [as oncomplete] (net.js:1080:14) code: 
 'ECONNRESET', errno: 'ECONNRESET', syscall: 'connect', address: 
 '100.69.12.100', port: 9200 }, isOperational: true, code: 
 'ECONNRESET', errno: 'ECONNRESET', syscall: 'connect', address: 
 '100.69.12.100', port: 9200 }

此错误非常随机发生，不需要指示高负载，而不是定义的时间段。在那些ECONNRESET的确切时间，kube-dns添加日志创建（娱乐？）ElasticSearch Service（和其他服务）：

I0515 08:11:49.941166       1 dns.go:264] New service: elasticsearch
I0515 08:11:49.941226       1 dns.go:462] Added SRV record &{Host:elasticsearch.prod.svc.cluster.local. Port:9200 Priority:10 Weight:10 Text: Mail:false Ttl:30 TargetStrip:0 Group: Key:}
I0515 08:11:49.941275       1 dns.go:462] Added SRV record &{Host:elasticsearch.prod.svc.cluster.local. Port:9300 Priority:10 Weight:10 Text: Mail:false Ttl:30 TargetStrip:0 Group: Key:}

对于每个Kubernetes服务，这些行几乎每5分钟重复一次。如果这是正常的行为，或者与我们观察到的失败有关，我完全无能为力。

API和ElasticSearch之间通过网络的交互基本上是一组搜索查询（并行最多30/50）。在交互期间，ElasticSearch端没有观察到错误日志。

这种情况很好，几个月来，变化是：

Kubernetes使用Kops从版本1.4升级到1.6
为ElasticSearch集群添加了CPU /内存限制。遵循弹性建议
为Elastic添加了额外的初始30次调用，这些调用在首次调用后被缓存

尝试的行动：

ElasticSearch端的资源限制（CPU /内存）减少和增加，以查看行为的任何变化。对ECONNRESET没有影响
增加了在ElasticSearch端搜索的线程池和队列大小。它处理更多负载，但问题本身表现在低负载下。

考虑回滚到Kubernetes 1.4，并删除任何限制。任何指针或信息都非常受欢迎。

更新1 ：围绕elasticsearch服务的一些额外信息：

$ kubectl get svc elasticsearch
NAME             CLUSTER-IP       EXTERNAL-IP    PORT(S)            AGE
elasticsearch   100.65.113.208    [AWS ELB] 9200:30295/TCP,9300:32535/TCP   6d

$ kubectl describe svc elasticsearch
  Name:         elasticsearch
  Namespace:        ppe
  Labels:           <none>
  Annotations:      kubernetes.io/change-cause=kubectl create --filename=app/elasticsearch/manifest/ --recursive=true --record=true --namespace=ppe --context=
  Selector:     component=elasticsearch,provider=fabric8,type=client
  Type:         LoadBalancer
  IP:           100.65.113.208
  LoadBalancer Ingress: [AWS ELB]
  Port:         http    9200/TCP
  NodePort:     http    30295/TCP
  Endpoints:        100.96.1.25:9200,100.96.1.26:9200,100.96.2.24:9200 + 5 more...
  Port:         transport   9300/TCP
  NodePort:     transport   32535/TCP
  Endpoints:        100.96.1.25:9300,100.96.1.26:9300,100.96.2.24:9300 + 5 more...
  Session Affinity: None
  Events:           <none>

更新2

通过使用官方ElasticSearch NodeJS客户端进行搜索查询并与服务进行通信，我现在能够缓解此问题。 API上的代码使用request模块直接调用ElasticSearch REST API。

我仍然在研究这个问题，因为问题仍然存在，但是当使用NodeJS客户端时，它似乎没有表现出来：

const elasticSearchClient = elasticSearch.Client({host: config.endpoints.elasticsearch, apiVersion: '2.3',
maxRetries: 5, requestTimeout: 15000, deadTimeout: 30000, keepAlive: true});

更新3

我们观察到此行为不仅与此NodeJS API与Elastic进行通信，还与kibana-logging Kubernetes服务和elasticdump进行了通信：

Mon, 15 May 2017 13:32:31 GMT | sent 1000 objects to destination 
elasticsearch, wrote 1000
Mon, 15 May 2017 13:32:42 GMT | got 1000 objects from source 
elasticsearch (offset: 24000)
Mon, 15 May 2017 13:32:44 GMT | Error Emitted => failed to parse json 
(message: "Unexpected token E in JSON at position 0") - source: "Error: 
'read tcp 172.20.33.123:46758->100.96.4.15:9200: read: connection reset 
by peer'\nTrying to reach: 'http://100.96.4.15:9200/ghs.products-2017-
05-13/_bulk'"

这两个模块和原始NodeJS API都使用request NPM模块与ElasticSearch进行通信。有趣的是，ElasticSearch NodeJS客户端（https://github.com/elastic/elasticsearch-js）似乎没有使用request。

仍在调查，但这可能最终成为Elastic Kubernetes Service暴露+ request NodeJS模块之间的问题/不兼容

ElasticSearch Service在Kubernetes 1.6

0 个答案: