zookeeper在重启后没有运行

时间:2014-05-14 15:51:28

标签: solr apache-zookeeper

我有3个zookeeper节点。那些节点工作正常但是当我使用./zkServer.sh重启重启那些节点时,zookeeper没有再起来。

当我检查动物园管理员状态时,它返回:

./zkServer.sh status
JMX enabled by default
Using config: /opt/zookeeper/bin/../conf/zoo.cfg
Error contacting service. It is probably not running.

我的zoo.cnf是:

dataDir=/var/lib/zookeeperdata/3
clientPort=2181
initLimit=50
tickTime=2000
syncLimit=10
maxClientCnxns=100000
server.1=IP1 value:2888:3888
server.2=IP2 value:2889:3889
server.3=127.0.0.1:2890:3890

这是不稳定的行为,因为可能是两小时后或明天如果我为3个zookeeper节点重新启动,他们会看到对方并且工作正常,因为这发生在我之前。

zookeeper日志:

2014-05-14 15:22:34,236 [myid:3] - INFO  [main:NIOServerCnxnFactory@94] - binding to port 0.0.0.0/0.0.0.0:2181
2014-05-14 15:22:34,282 [myid:3] - INFO  [main:QuorumPeer@913] - tickTime set to 2000
2014-05-14 15:22:34,283 [myid:3] - INFO  [main:QuorumPeer@933] - minSessionTimeout set to -1
2014-05-14 15:22:34,283 [myid:3] - INFO  [main:QuorumPeer@944] - maxSessionTimeout set to -1
2014-05-14 15:22:34,283 [myid:3] - INFO  [main:QuorumPeer@959] - initLimit set to 50
2014-05-14 15:22:34,356 [myid:3] - INFO  [main:FileSnap@83] - Reading snapshot /var/lib/zookeeperdata/3/version-2/snapshot.f100000001
2014-05-14 15:22:43,387 [myid:3] - INFO  [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxnFactory@197] - Accepted socket connection from /127.0.0.1:50923
2014-05-14 15:22:43,396 [myid:3] - INFO  [Thread-1:QuorumCnxManager$Listener@486] - My election bind port: 0.0.0.0/0.0.0.0:3890
2014-05-14 15:22:43,404 [myid:3] - WARN  [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxn@354] - Exception causing close of session 0x0 due to java.io.IOExce
ption: ZooKeeperServer not running
2014-05-14 15:22:43,404 [myid:3] - INFO  [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxn@1001] - Closed socket connection for client /127.0.0.1:50923 (no se
ssion established for client)
2014-05-14 15:22:43,427 [myid:3] - INFO  [QuorumPeer[myid=3]/0:0:0:0:0:0:0:0:2181:QuorumPeer@670] - LOOKING
2014-05-14 15:22:43,429 [myid:3] - INFO  [QuorumPeer[myid=3]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@740] - New election. My id =  3, proposed zxid=0xf100000001
2014-05-14 15:22:48,438 [myid:3] - WARN  [WorkerSender[myid=3]:QuorumCnxManager@368] - Cannot open channel to 1 at election address /54.76.10.81:3888
java.net.SocketTimeoutException: connect timed out
  at java.net.PlainSocketImpl.socketConnect(Native Method)
  at java.net.PlainSocketImpl.doConnect(PlainSocketImpl.java:351)
  at java.net.PlainSocketImpl.connectToAddress(PlainSocketImpl.java:213)
  at java.net.PlainSocketImpl.connect(PlainSocketImpl.java:200)
  at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:366)
  at java.net.Socket.connect(Socket.java:529)
  at org.apache.zookeeper.server.quorum.QuorumCnxManager.connectOne(QuorumCnxManager.java:354)
  at org.apache.zookeeper.server.quorum.QuorumCnxManager.toSend(QuorumCnxManager.java:327)
  at org.apache.zookeeper.server.quorum.FastLeaderElection$Messenger$WorkerSender.process(FastLeaderElection.java:393)
  at org.apache.zookeeper.server.quorum.FastLeaderElection$Messenger$WorkerSender.run(FastLeaderElection.java:365)
  at java.lang.Thread.run(Thread.java:662)
2014-05-14 15:22:53,440 [myid:3] - WARN  [QuorumPeer[myid=3]/0:0:0:0:0:0:0:0:2181:QuorumCnxManager@368] - Cannot open channel to 1 at election address /54.76.10.81:3
888
java.net.SocketTimeoutException: connect timed out
  at java.net.PlainSocketImpl.socketConnect(Native Method)
  at java.net.PlainSocketImpl.doConnect(PlainSocketImpl.java:351)
  at java.net.PlainSocketImpl.connectToAddress(PlainSocketImpl.java:213)
  at java.net.PlainSocketImpl.connect(PlainSocketImpl.java:200)
  at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:366)
  at java.net.Socket.connect(Socket.java:529)
  at org.apache.zookeeper.server.quorum.QuorumCnxManager.connectOne(QuorumCnxManager.java:354)
  at org.apache.zookeeper.server.quorum.QuorumCnxManager.connectAll(QuorumCnxManager.java:388)

我在这方面搜索了很多,但我没有找到任何对我有用的东西,所以我希望有人可以帮助我。

由于

3 个答案:

答案 0 :(得分:1)

我也看到过这样的行为。一直运行良好的ZK配置有时会无法重启。当发生这种情况时,我尝试了以下方法:

1)查看所有服务器的日志...通常会列出错误 2)停止所有服务器并重新启动 3)停止所有服务器并一次重启一台服务器 4)验证每个服务器的myid文件是否存在,具有正确的权限并具有正确的值。

我已经使用clusterssh打开每个服务器的窗口,以便重新启动可以在同一时间......然后我将所有服务器日志都添加了。请记住,在重新启动期间,ZK群集正在做很多事情:启动每个服务器并选择一个领导者。我曾经有过集群似乎失败的时候,经过几分钟之后它似乎已经弄明白了。

有一个名为zktop的工具,我用它来监控ZK。

答案 1 :(得分:0)

我通过将IP 127.0.0.1更改为amazon节点的内部IP来修复它,在对三个节点进行此更改并重新启动后,此问题再次没有发生。我希望这个答案可以帮助有人询问同样的问题。

答案 2 :(得分:0)

确保在每个节点配置中都放置了正确的数据Dir。 并在数据目录中放置 myid 文件,并在 myid <中为每个节点添加1-255之间的数字/ strong>文件。 我认为它解决了这个问题。