Question

我在5台服务器上有一个带有5个分片的分片群集，每个分片是一个副本集，包含1个主节点和2个辅助节点。 Linux RHEL7.2上的MongoDB 3.4.5。

5台服务器的主机名为dscn022 - dscn026

群集运行良好一周，然后发生在配置服务器的日志

中

> Tue Feb  6 15:37:28.904 I NETWORK  [thread2] Listener: accept() returns -1 Too many open files

>Tue Feb  6 15:37:28.905 E NETWORK  [thread2] Out of file descriptors. Waiting one second before trying to accept more connections.

然后我检查了ulimit，mongoDB有可能使用那么多吗？

>[root@dscn022 ~]# ulimit -a

> open files                      (-n) 655360

之后，文件描述符重复了大约40次，其他错误记录发生了。

> Tue Feb  6 15:38:10.360 I ASIO     [NetworkInterfaceASIO-ShardRegistry-0] Connecting to dscn023:27018

> Tue Feb  6 15:38:10.360 I ASIO     [NetworkInterfaceASIO-ShardRegistry-0] Failed to connect to dscn023:27018 - HostUnreachable: HostUnreachable

> Tue Feb  6 15:38:10.360 I ASIO     [NetworkInterfaceASIO-ShardRegistry-0] Dropping all pooled connections to dscn023:27018 due to failed operation on a connection

> Tue Feb  6 15:38:10.360 I NETWORK  [replSetDistLockPinger] Marking host dscn023:27018 as failed :: caused by :: HostUnreachable: HostUnreachable

> Tue Feb  6 15:38:10.360 W SHARDING [replSetDistLockPinger] pinging failed for distributed lock pinger :: caused by :: HostUnreachable: HostUnreachable

> Tue Feb  6 15:38:10.914 I NETWORK  [thread2] connection accepted from 10.11.2.24:34192 #1680 (116 connections now open)

> Tue Feb  6 15:38:10.915 I NETWORK  [conn1680] received client metadata from 10.11.2.24:34192 conn1680: { driver: { name: "mongo-java-driver", version: "unknown" }, os: { type: "Linux", name: "Linux", architecture: "amd64", version: "3.10.0-327.el7.x86_64" }, platform: "Java/Oracle Corporation/1.8.0_73-b02" }

> Tue Feb  6 15:38:10.920 I NETWORK  [conn1680] Successfully connected to TOD-Shard-1/dscn022:27019,dscn023:27019,dscn026:27019 (65 connections now open to TOD-Shard-1/dscn022:27019,dscn023:27019,dscn026:27019 with a 0 second timeout)

> Tue Feb  6 15:38:10.920 I NETWORK  [conn1680] getaddrinfo("dscn022") failed: No address associated with hostname

> Tue Feb  6 15:38:10.920 I NETWORK  [conn1680] Marking host dscn022:27019 as failed :: caused by :: Location40333: can't connect to new replica set master [dscn022:27019], err: couldn't initialize connection to host dscn022, address is invalid

> Tue Feb  6 15:38:10.921 I NETWORK  [conn1680] getaddrinfo("dscn023") failed: No address associated with hostname

> Tue Feb  6 15:38:10.921 I NETWORK  [conn1680] Marking host dscn023:27019 as failed :: caused by :: Location11002: can't callLazy replica set node dscn023:27019: socket exception [CONNECT_ERROR] for dscn023:27019

> Tue Feb  6 15:38:10.921 I NETWORK  [conn1680] getaddrinfo("dscn026") failed: No address associated with hostname

> Tue Feb  6 15:38:10.921 I NETWORK  [conn1680] Marking host dscn026:27019 as failed :: caused by :: Location11002: can't callLazy replica set node dscn026:27019: socket exception [CONNECT_ERROR] for dscn026:27019

> Tue Feb  6 15:38:10.921 I NETWORK  [conn1680] getaddrinfo("dscn022") failed: No address associated with hostname

> Tue Feb  6 15:38:10.921 I NETWORK  [conn1680] Marking host dscn022:27019 as failed :: caused by :: Location40333: can't connect to new replica set master [dscn022:27019], err: couldn't initialize connection to host dscn022, address is invalid

> Tue Feb  6 15:38:10.921 W NETWORK  [conn1680] db exception when initializing on TOD-Shard-1, current connection state is { state: { conn: "TOD-Shard-1/dscn022:27019,dscn023:27019,dscn026:27019", vinfo: "expressbox.box @ 5|2326||5a7423d5a8007256eb0d6914", cursor: "(empty)", count: 0, done: false }, retryNext: false, init: false, finish: false, errored: false } :: caused by :: 16380 Failed to call say, no good nodes in TOD-Shard-1, last error: can't callLazy replica set node dscn022:27019: can't connect to new replica set master [dscn022:27019], err: couldn't initialize connection to host dscn022, address is invalid

> Tue Feb  6 15:38:10.921 I NETWORK  [conn1680] Ending connection to host TOD-Shard-1/dscn022:27019,dscn023:27019,dscn026:27019(with timeout of 0 seconds) due to bad connection status; 64 connections to that host remain open

> Tue Feb  6 15:38:15.103 I ASIO     [NetworkInterfaceASIO-ShardRegistry-0] Connecting to dscn024:27018

> Tue Feb  6 15:38:15.105 I ASIO     [NetworkInterfaceASIO-ShardRegistry-0] Failed to connect to dscn024:27018 - HostUnreachable: HostUnreachable

> Tue Feb  6 15:38:15.105 I ASIO     [NetworkInterfaceASIO-ShardRegistry-0] Dropping all pooled connections to dscn024:27018 due to failed operation on a connection

> Tue Feb  6 15:38:31.997 I ACCESS   [UserCacheInvalidator] User cache generation changed from 5a7418c44db54c97a8335650 to 5a7418c67d14edfa21ed4f3d; invalidating user cache

配置服务器似乎找不到所有主机名，甚至找不到自己的主机名。

重新启动路由器和配置服务器之后，集群似乎没问题，所有功能都很好，但是我的应用程序无法使用读取关注“辅助”或“辅助优先”查询java驱动程序。

使用Primary或PrimaryPreferred查询分片集合，所有读取问题都可以查询非分片集合。

同样的情况发生在两周前，我重建了集群，现在又发生了。

为什么会发生这种情况，我应该怎么做才能解决它？

重启mongo路由器后查询mongodb二级节点卡在java-client上

0 个答案: