Question

我一直致力于持续监控分布式原子长计数器的过程。它使用以下类ZkClient的方法getCounter每分钟监视它。实际上，我有多个线程运行，每个线程都监视存储在Zookeeper节点中的不同计数器（分布式原子长度）。每个线程通过getCounter方法的参数指定计数器的路径。

public class TagserterZookeeperManager {

public enum ZkClient {
    COUNTER("10.11.18.25:2181");  // Integration URL

    private CuratorFramework client;
    private ZkClient(String servers) {
        Properties props = TagserterConfigs.ZOOKEEPER.getProperties();
        String zkFromConfig = props.getProperty("servers", "");
        if (zkFromConfig != null && !zkFromConfig.isEmpty()) {
            servers = zkFromConfig.trim();
        }
        ExponentialBackoffRetry exponentialBackoffRetry = new ExponentialBackoffRetry(1000, 3);
        client = CuratorFrameworkFactory.newClient(servers, exponentialBackoffRetry);
        client.start();
    }

    public CuratorFramework getClient() {
        return client;
    }
}

public static String buildPath(String ... node) {
    StringBuilder sb = new StringBuilder();
    for (int i = 0; i < node.length; i++) {
        if (node[i] != null && !node[i].isEmpty()) {
            sb.append("/");
            sb.append(node[i]);
        }
    }
    return sb.toString();
}

public static DistributedAtomicLong getCounter(String taskType, int hid, String jobId, String countType) {
    String path = buildPath(taskType, hid+"", jobId, countType);
    Builder builder = PromotedToLock.builder().lockPath(path + "/lock").retryPolicy(new ExponentialBackoffRetry(10, 10));
    DistributedAtomicLong count = new DistributedAtomicLong(ZkClient.COUNTER.getClient(), path, new RetryNTimes(5, 20), builder.build());
    return count;
}

}

在线程中，这就是我调用此方法的方式：

    DistributedAtomicLong counterTotal = TagserterZookeeperManager
                        .getCounter("testTopic", hid, jobId, "test");

现在似乎在线程运行了几个小时之后，在一个阶段我开始在org.apache.zookeeper.KeeperException$ConnectionLossException方法中获取以下getCounter异常，它尝试读取计数：

org.apache.zookeeper.KeeperException $ ConnectionLossException：KeeperErrorCode = / contentTaskProd的ConnectionLoss at org.apache.zookeeper.KeeperException.create（KeeperException.java:99） at org.apache.zookeeper.KeeperException.create（KeeperException.java:51）在org.apache.zookeeper.ZooKeeper.exists（ZooKeeper.java:1045）在org.apache.zookeeper.ZooKeeper.exists（ZooKeeper.java:1073）在org.apache.curator.utils.ZKPaths.mkdirs（ZKPaths.java:215）在org.apache.curator.utils.EnsurePath $ InitialHelper $ 1.call（EnsurePath.java:148）在org.apache.curator.RetryLoop.callWithRetry（RetryLoop.java:107） at org.apache.curator.utils.EnsurePath $ InitialHelper.ensure（EnsurePath.java:141）在org.apache.curator.utils.EnsurePath.ensure（EnsurePath.java:99）在org.apache.curator.framework.recipes.atomic.DistributedAtomicValue.getCurrentValue（DistributedAtomicValue.java:254） at org.apache.curator.framework.recipes.atomic.DistributedAtomicValue.get（DistributedAtomicValue.java:91）在org.apache.curator.framework.recipes.atomic.DistributedAtomicLong.get（DistributedAtomicLong.java:72） ...

我一直从这里得到这个异常一段时间后我感觉它导致了一些内部内存泄漏，最终导致OutOfMemory错误并且整个过程失败了。有没有人知道这可能是什么原因？为什么Zookeeper突然开始抛出连接丢失异常？在进程退出之后，我可以通过我编写的另一个小型控制台程序（也使用策展人）手动连接到Zookeeper，并且所有看起来都很好。

Answer 1

为了使用curator监控Zookeeper中的节点，您可以使用NodeCache这不会解决您的连接问题....但不是每分钟轮询一次节点就可以获得它发生变化时的推送事件。

根据我的经验，NodeCache可以很好地断开连接并恢复连接。

Apache Curator - Zookeeper连接丢失异常，可能的内存泄漏

1 个答案: