我们正在使用2节点cratedb群集(v2.3.4)。它运行正常一个多月没有任何问题。最近我们发现一个节点在没有任何外部干扰的情况下消失了。我们无法找到此事件的根本原因。
以下是日志。请帮忙。
Apr 12 23:47:04 STATS-DB-M crate[162556]: at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) ~[?:1.8.0_131]
Apr 12 23:47:04 STATS-DB-M crate[162556]: at java.lang.Thread.run(Thread.java:748) [?:1.8.0_131]
Apr 12 23:47:04 STATS-DB-M crate[162556]: [2018-04-12T23:47:04,027][WARN ][o.e.c.a.s.ShardStateAction] [crate3] [online_dlr_report_cache_20180412][7] received shard failed for shard id [[online_dlr_report_cache_20180412][7]], allocation id [NahsM0yfRPaHA5waOpu5OA], primary term [2], message [mark copy as stale]
Apr 12 23:47:04 STATS-DB-M crate[162556]: [2018-04-12T23:47:04,027][WARN ][o.e.c.a.s.ShardStateAction] [crate3] [online_dlr_report_cache_20180412][1] received shard failed for shard id [[online_dlr_report_cache_20180412][1]], allocation id [haMsWkQGTe-yTIfGSkLbHw], primary term [2], message [mark copy as stale]
Apr 12 23:47:04 STATS-DB-M crate[162556]: [2018-04-12T23:47:04,026][WARN ][o.e.c.a.s.ShardStateAction] [crate3] [online_dlr_report_cache_20180412][1] received shard failed for shard id [[online_dlr_report_cache_20180412][1]], allocation id [ZfHGc1DiTZmJ2JQ3YoA_Yg], primary term [1], message [failed to perform indices:crate/data/write/upsert on replica [online_dlr_report_cache_20180412][1], node[1RRQy42EQ8meT7S40loaEw], [R], s[STARTED], a[id=ZfHGc1DiTZmJ2JQ3YoA_Yg]], failure [RemoteTransportException[[crate3][192.168.1.50:4300][indices:crate/data/write/upsert[r]]]; nested: IllegalStateException[active primary shard cannot be a replication target before relocation hand off [online_dlr_report_cache_20180412][1], node[1RRQy42EQ8meT7S40loaEw], [P], s[STARTED], a[id=ZfHGc1DiTZmJ2JQ3YoA_Yg], state is [STARTED]]; ]
Apr 12 23:47:04 STATS-DB-M crate[162556]: org.elasticsearch.transport.RemoteTransportException: [crate3][192.168.1.50:4300][indices:crate/data/write/upsert[r]]
Apr 12 23:47:04 STATS-DB-M crate[162556]: Caused by: java.lang.IllegalStateException: active primary shard cannot be a replication target before relocation hand off [online_dlr_report_cache_20180412][1], node[1RRQy42EQ8meT7S40loaEw], [P], s[STARTED], a[id=ZfHGc1DiTZmJ2JQ3YoA_Yg], state is [STARTED]
Apr 12 23:47:10 STATS-DB-M systemd[1]: crate.service: main process exited, code=exited, status=126/n/a
Apr 12 23:47:10 STATS-DB-M systemd[1]: Unit crate.service entered failed state.
Apr 12 23:47:10 STATS-DB-M systemd[1]: crate.service failed.
答案 0 :(得分:0)
日志不会提示节点出现故障的原因。你有其他信息吗? 通常我们建议使用最小3节点集群,以便在节点出现故障时能够达到仲裁。 如果您有更多信息,请告诉我们。 谢谢,乔