MongoDB SDK故障转移无法正常工作

时间:2014-12-01 18:46:27

标签: java mongodb high-availability mongodb-java failover

我已经使用三台机器(192.168.122.21,192.168.122.147和192.168.122.148)设置了一个副本集,我正在使用Java SDK与MongoDB集群进行交互:

ArrayList<ServerAddress> addrs = new ArrayList<ServerAddress>();
addrs.add(new ServerAddress("192.168.122.21", 27017));
addrs.add(new ServerAddress("192.168.122.147", 27017));
addrs.add(new ServerAddress("192.168.122.148", 27017));
this.mongoClient = new MongoClient(addrs);
this.db = this.mongoClient.getDB(this.db_name);
this.collection = this.db.getCollection(this.collection_name);

建立连接后,我会对一个简单的测试文档进行多次插入:

    for (int i = 0; i < this.inserts; i++) {
        try {
           this.collection.insert(new BasicDBObject(String.valueOf(i), "test"));
        } catch (Exception e) {
            System.out.println("Error on inserting element: " + i);
            e.printStackTrace();
        }
    }

当模拟主服务器的节点崩溃(断电)时,MongoDB集群会成功进行故障转移:

       19:08:03.907+0100 [rsHealthPoll] replSet info 192.168.122.21:27017 is down (or slow to respond): 
       19:08:03.907+0100 [rsHealthPoll] replSet member 192.168.122.21:27017 is now in state DOWN
       19:08:04.153+0100 [rsMgr] replSet info electSelf 1
       19:08:04.154+0100 [rsMgr] replSet couldn't elect self, only received -9999 votes
       19:08:05.648+0100 [conn15] replSet info voting yea for 192.168.122.148:27017 (2)
       19:08:10.681+0100 [rsMgr] replSet not trying to elect self as responded yea to someone else recently
       19:08:10.910+0100 [rsHealthPoll] replset info 192.168.122.21:27017 heartbeat failed, retrying
       19:08:16.394+0100 [rsMgr] replSet not trying to elect self as responded yea to someone else recently
       19:08:22.876+.
       19:08:22.912+0100 [rsHealthPoll] replset info 192.168.122.21:27017 heartbeat failed, retrying
       19:08:23.623+0100 [SyncSourceFeedbackThread] replset setting syncSourceFeedback to 192.168.122.148:27017
       19:08:23.917+0100 [rsHealthPoll] replSet member 192.168.122.148:27017 is now in state PRIMARY

客户端的MongoDB驱动程序也识别了这一点:

       Dec 01, 2014 7:08:16 PM com.mongodb.ConnectionStatus$UpdatableNode update
       WARNING: Server seen down: /192.168.122.21:27017 - java.io.IOException - message: Read timed out
       WARNING: Server seen down: /192.168.122.21:27017 - java.io.IOException - message: couldn't connect to [/192.168.122.21:27017]  bc:java.net.SocketTimeoutException: connect timed out
       Dec 01, 2014 7:08:36 PM com.mongodb.DBTCPConnector setMasterAddress
       WARNING: Primary switching from /192.168.122.21:27017 to /192.168.122.148:27017

但它仍然一直试图连接到旧节点(永远):

       Dec 01, 2014 7:08:50 PM com.mongodb.ConnectionStatus$UpdatableNode update
       WARNING: Server seen down: /192.168.122.21:27017 - java.io.IOException - message: couldn't connect to [/192.168.122.21:27017] bc:java.net.NoRouteToHostException: No route to host
       .....
       Dec 01, 2014 7:10:43 PM com.mongodb.ConnectionStatus$UpdatableNode update
       WARNING: Server seen down: /192.168.122.21:27017 - java.io.IOException -message: couldn't connect to [/192.168.122.21:27017] bc:java.net.NoRouteToHostException: No route to host

数据库上的文档计数从主数据库失败并且辅助数据库成为主数据库的那一刻起保持不变。以下是过程中同一节点的输出:

  

&#34; RS0&#34;:SECONDARY&GT; db.test_collection.find()。count()12260161

     

&#34; RS0&#34;:PRIMARY&GT; db.test_collection.find()。count()12260161

更新 使用WriteConcern Unacknowledged它按设计工作。插入操作也在新主服务器上执行,选举过程中的所有操作都会丢失。

使用WriteConcern Acknowleged似乎操作正在无限期地等待来自崩溃的主设备的ACK。这可以解释为什么程序在崩溃的服务器再次启动后连续加入集群作为辅助程序。但在我的情况下,我不希望司机永远等待,它会在一段时间后引发错误。

更新 WriteConcern Acknowledged在杀死主节点上的mongod进程时也按预期工作。在这种情况下,故障转移只需要约3秒。在此期间,不会执行任何插入操作,并且在选择新的主节点之后,插入操作将继续。

所以我只在模拟节点故障(断电/网络故障)时遇到问题。在这种情况下,操作将挂起,直到故障节点再次启动。

2 个答案:

答案 0 :(得分:0)

您的应用仍然可以使用吗?由于该服务器仍在您的种子列表中,因此据我所知,驱动程序将尝试连接到它。只要种子列表中的任何其他服务器都可以获得主要状态,您的应用程序仍然可以正常工作。

答案 1 :(得分:0)

显式指定连接超时值解决了错误。另见:http://api.mongodb.org/java/2.7.0/com/mongodb/MongoOptions.html