我每天处理大约70亿个文档到我的solr云中,10个实例运行5GB XMX和XMS值,这被推到一个名为“X'”的集合中。它的模式有大约150多个字段,其中几乎所有字段都被编入索引。集合X有240个分片,每个分片有2个复制因子。
我目前面临的问题是,在240个分片中,3到4个分片随机下降,但有以下异常:
org.apache.solr.common.SolrException: No registered leader was found after waiting for 4000ms , collection: X slice: shard118
at org.apache.solr.common.cloud.ZkStateReader.getLeaderRetry(ZkStateReader.java:747)
at org.apache.solr.common.cloud.ZkStateReader.getLeaderRetry(ZkStateReader.java:733)
at org.apache.solr.cloud.RecoveryStrategy.doRecovery(RecoveryStrategy.java:305)
at org.apache.solr.cloud.RecoveryStrategy.run(RecoveryStrategy.java:221)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at org.apache.solr.common.util.ExecutorUtil$MDCAwareThreadPoolExecutor.lambda$execute$0(ExecutorUtil.java:229)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
接下来我们在solr日志中发现了另一个例外:
ERROR (zkCallback-4-thread-4-processing-n:<IP>:8983_solr) [c:X s:shard63 r:core_node57 x:X_shard63_replica1] o.a.s.c.Overseer Could not create Overseer node
org.apache.zookeeper.KeeperException$SessionExpiredException: KeeperErrorCode = Session expired for /overseer
at org.apache.zookeeper.KeeperException.create(KeeperException.java:127)
at org.apache.zookeeper.KeeperException.create(KeeperException.java:51)
at org.apache.zookeeper.ZooKeeper.create(ZooKeeper.java:783)
at org.apache.solr.common.cloud.SolrZkClient$9.execute(SolrZkClient.java:391)
at org.apache.solr.common.cloud.SolrZkClient$9.execute(SolrZkClient.java:388)
at org.apache.solr.common.cloud.ZkCmdExecutor.retryOperation(ZkCmdExecutor.java:60)
at org.apache.solr.common.cloud.SolrZkClient.create(SolrZkClient.java:388)
at org.apache.solr.cloud.Overseer.createOverseerNode(Overseer.java:731)
at org.apache.solr.cloud.Overseer.getStateUpdateQueue(Overseer.java:604)
at org.apache.solr.cloud.Overseer.getStateUpdateQueue(Overseer.java:591)
at org.apache.solr.cloud.ShardLeaderElectionContext.runLeaderProcess(ElectionContext.java:314)
at org.apache.solr.cloud.LeaderElector.runIamLeaderProcess(LeaderElector.java:170)
at org.apache.solr.cloud.LeaderElector.checkIfIamLeader(LeaderElector.java:135)
at org.apache.solr.cloud.LeaderElector.access$200(LeaderElector.java:56)
at org.apache.solr.cloud.LeaderElector$ElectionWatcher.process(LeaderElector.java:348)
at org.apache.solr.common.cloud.SolrZkClient$3.lambda$process$0(SolrZkClient.java:268)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at org.apache.solr.common.util.ExecutorUtil$MDCAwareThreadPoolExecutor.lambda$execute$0(ExecutorUtil.java:229)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
作为解决方案,我删除了所有副本,并为已关闭的分片重新创建它们。这解决了问题,但它间歇性地工作。
如果没有找到同步副本,这也有丢失大量数据的风险。
任何人都可以建议我更好的方法来解决这个问题。这也是在生产中发生的,所以我无法重新创建集合(这解决了问题,但同样的问题可能会在一段时间后重新出现)也不能重新启动zookeeper,因为许多其他的火花作业都依赖于同一个。
我长期陷入困境。
更新:
我们没有对SolrCloud执行任何操作,因为碎片可能会崩溃。发生的唯一操作是在此集合的顶部运行spark批处理作业以处理数据。火花批处理作业每天运行两次,但在此期间碎片不会降低。