SOLR分片突然下降

时间:2017-11-30 16:08:41

标签: apache hadoop solr

我每天处理大约70亿个文档到我的solr云中,10个实例运行5GB XMX和XMS值,这被推到一个名为“X'”的集合中。它的模式有大约150多个字段,其中几乎所有字段都被编入索引。集合X有240个分片,每个分片有2个复制因子。

我目前面临的问题是,在240个分片中,3到4个分片随机下降,但有以下异常:

org.apache.solr.common.SolrException: No registered leader was found after waiting for 4000ms , collection: X slice: shard118
    at org.apache.solr.common.cloud.ZkStateReader.getLeaderRetry(ZkStateReader.java:747)
    at org.apache.solr.common.cloud.ZkStateReader.getLeaderRetry(ZkStateReader.java:733)
    at org.apache.solr.cloud.RecoveryStrategy.doRecovery(RecoveryStrategy.java:305)
    at org.apache.solr.cloud.RecoveryStrategy.run(RecoveryStrategy.java:221)
    at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
    at java.util.concurrent.FutureTask.run(FutureTask.java:266)
    at org.apache.solr.common.util.ExecutorUtil$MDCAwareThreadPoolExecutor.lambda$execute$0(ExecutorUtil.java:229)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
    at java.lang.Thread.run(Thread.java:745)

接下来我们在solr日志中发现了另一个例外:

ERROR (zkCallback-4-thread-4-processing-n:<IP>:8983_solr) [c:X s:shard63 r:core_node57 x:X_shard63_replica1] o.a.s.c.Overseer Could not create Overseer node
org.apache.zookeeper.KeeperException$SessionExpiredException: KeeperErrorCode = Session expired for /overseer
    at org.apache.zookeeper.KeeperException.create(KeeperException.java:127)
    at org.apache.zookeeper.KeeperException.create(KeeperException.java:51)
    at org.apache.zookeeper.ZooKeeper.create(ZooKeeper.java:783)
    at org.apache.solr.common.cloud.SolrZkClient$9.execute(SolrZkClient.java:391)
    at org.apache.solr.common.cloud.SolrZkClient$9.execute(SolrZkClient.java:388)
    at org.apache.solr.common.cloud.ZkCmdExecutor.retryOperation(ZkCmdExecutor.java:60)
    at org.apache.solr.common.cloud.SolrZkClient.create(SolrZkClient.java:388)
    at org.apache.solr.cloud.Overseer.createOverseerNode(Overseer.java:731)
    at org.apache.solr.cloud.Overseer.getStateUpdateQueue(Overseer.java:604)
    at org.apache.solr.cloud.Overseer.getStateUpdateQueue(Overseer.java:591)
    at org.apache.solr.cloud.ShardLeaderElectionContext.runLeaderProcess(ElectionContext.java:314)
    at org.apache.solr.cloud.LeaderElector.runIamLeaderProcess(LeaderElector.java:170)
    at org.apache.solr.cloud.LeaderElector.checkIfIamLeader(LeaderElector.java:135)
    at org.apache.solr.cloud.LeaderElector.access$200(LeaderElector.java:56)
    at org.apache.solr.cloud.LeaderElector$ElectionWatcher.process(LeaderElector.java:348)
    at org.apache.solr.common.cloud.SolrZkClient$3.lambda$process$0(SolrZkClient.java:268)
    at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
    at java.util.concurrent.FutureTask.run(FutureTask.java:266)
    at org.apache.solr.common.util.ExecutorUtil$MDCAwareThreadPoolExecutor.lambda$execute$0(ExecutorUtil.java:229)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
    at java.lang.Thread.run(Thread.java:745)

作为解决方案,我删除了所有副本,并为已关闭的分片重新创建它们。这解决了问题,但它间歇性地工作。

如果没有找到同步副本,这也有丢失大量数据的风险。

任何人都可以建议我更好的方法来解决这个问题。这也是在生产中发生的,所以我无法重新创建集合(这解决了问题,但同样的问题可能会在一段时间后重新出现)也不能重新启动zookeeper,因为许多其他的火花作业都依赖于同一个。

我长期陷入困境。

更新:

我们没有对SolrCloud执行任何操作,因为碎片可能会崩溃。发生的唯一操作是在此集合的顶部运行spark批处理作业以处理数据。火花批处理作业每天运行两次,但在此期间碎片不会降低。

0 个答案:

没有答案