Cassandra bootstrap在连接状态下挂了很长时间

时间:2015-03-30 06:51:47

标签: cassandra

我们的生产环境有两个cassandra节点(v2.0.5),我们想要添加额外的节点来扩展可伸缩性。我们按照Datastax doc

中的步骤desc进行了操作

在引导新节点之后,我们观察到了一些异常日志 ERROR [CompactionExecutor:42] 2015-03-25 19:01:01,821 CassandraDaemon.java (line 192) Exception in thread Thread[CompactionExecutor:42,1,main] java.lang.RuntimeException: org.apache.cassandra.db.filter.TombstoneOverwhelmingException at org.apache.cassandra.service.pager.QueryPagers$1.next(QueryPagers.java:154) at org.apache.cassandra.service.pager.QueryPagers$1.next(QueryPagers.java:137)

它重复一些压缩任务,两周后它没有完成bootstrap。节点仍未加入状态

INFO [CompactionExecutor:4468] 2015-03-30 09:18:20,288 ColumnFamilyStore.java (line 784) Enqueuing flush of Memtable-compactions_in_progress@1247174540(212/13568 serialized/live bytes, 7 ops) INFO [FlushWriter:314] 2015-03-30 09:18:22,408 Memtable.java (line 373) Completed flushing /var/lib/cassandra/data/production_alarm_keyspace/alarm_history_data_new/production_alarm_keyspace-alarm_history_data_new-jb-118-Data.db (11216962 bytes) for commitlog position ReplayPosition(segmentId=1427280544702, position=24550137) INFO [FlushWriter:314] 2015-03-30 09:18:22,409 Memtable.java (line 333) Writing Memtable-alarm_master_data@37361826(26718076/141982437 serialized/live bytes, 791595 ops) INFO [FlushWriter:314] 2015-03-30 09:18:24,018 Memtable.java (line 373) Completed flushing /var/lib/cassandra/data/production_alarm_keyspace/alarm_master_data/production_alarm_keyspace-alarm_master_data-jb-346-Data.db (8407637 bytes) for commitlog position ReplayPosition(segmentId=1427280544702, position=24550137) INFO [FlushWriter:314] 2015-03-30 09:18:24,018 Memtable.java (line 333) Writing Memtable-compactions_in_progress@1247174540(212/13568 serialized/live bytes, 7 ops) INFO [FlushWriter:314] 2015-03-30 09:18:24,185 Memtable.java (line 373) Completed flushing /var/lib/cassandra/data/system/compactions_in_progress/system-compactions_in_progress-jb-1019-Data.db (201 bytes) for commitlog position ReplayPosition(segmentId=1427280544702, position=24550511) INFO [CompactionExecutor:4468] 2015-03-30 09:18:24,186 CompactionTask.java (line 115) Compacting [SSTableReader(path='/var/lib/cassandra/data/production_alarm_keyspace/alarm_common_dump_by_minutes/production_alarm_keyspace-alarm_common_dump_by_minutes-jb-356-Data.db'), SSTableReader(path='/var/lib/cassandra/data/production_alarm_keyspace/alarm_common_dump_by_minutes/production_alarm_keyspace-alarm_common_dump_by_minutes-jb-357-Data.db'), SSTableReader(path='/var/lib/cassandra/data/production_alarm_keyspace/alarm_common_dump_by_minutes/production_alarm_keyspace-alarm_common_dump_by_minutes-jb-355-Data.db'), SSTableReader(path='/var/lib/cassandra/data/production_alarm_keyspace/alarm_common_dump_by_minutes/production_alarm_keyspace-alarm_common_dump_by_minutes-jb-354-Data.db')] INFO [CompactionExecutor:4468] 2015-03-30 09:18:39,189 ColumnFamilyStore.java (line 784) Enqueuing flush of Memtable-compactions_in_progress@810255650(0/0 serialized/live bytes, 1 ops) INFO [FlushWriter:314] 2015-03-30 09:18:39,189 Memtable.java (line 333) Writing Memtable-compactions_in_progress@810255650(0/0 serialized/live bytes, 1 ops) INFO [FlushWriter:314] 2015-03-30 09:18:39,357 Memtable.java (line 373) Completed flushing /var/lib/cassandra/data/system/compactions_in_progress/system-compactions_in_progress-jb-1020-Data.db (42 bytes) for commitlog position ReplayPosition(segmentId=1427280544702, position=25306969) INFO [CompactionExecutor:4468] 2015-03-30 09:18:39,367 CompactionTask.java (line 275) Compacted 4 sstables to [/var/lib/cassandra/data/production_alarm_keyspace/alarm_common_dump_by_minutes/production_alarm_keyspace-alarm_common_dump_by_minutes-jb-358,]. 70,333,241 bytes to 70,337,669 (~100% of original) in 15,180ms = 4.418922MB/s. 260 total partitions merged to 248. Partition merge counts were {1:236, 2:12, }

Nodetool状态只显示两个节点,并且它是接受的,因为2.0.5在nodetool中有错误,不显示连接节点。 [bpmesb@bpmesbap2 ~]$ nodetool status Datacenter: datacenter1 ======================= Status=Up/Down |/ State=Normal/Leaving/Joining/Moving -- Address Load Tokens Owns Host ID Rack UN 172.18.8.56 99 GB 256 51.1% 7321548a-3998-4122-965f-0366dd0cc47e rack1 UN 172.18.8.57 93.78 GB 256 48.9% bb306032-ff1c-4209-8300-d8c3de843f26 rack1

有人可以帮忙解决这个问题吗?因为datastax说bootstrap只需要几分钟但我们的情况在2周后没有完成?我们搜索stackoverflow并找到This issue

可能与我们的问题有关

1 个答案:

答案 0 :(得分:0)

几天后测试并查看异常日志。我们发现这可能是这个问题的关键问题。

ERROR 19:01:01,821 Exception in thread Thread[NonPeriodicTasks:1,5,main]
java.lang.RuntimeException: java.util.concurrent.ExecutionException: java.lang.RuntimeException: org.apache.cassandra.db.filter.TombstoneOverwhelmingException
at org.apache.cassandra.utils.FBUtilities.waitOnFuture(FBUtilities.java:413)
at org.apache.cassandra.db.index.SecondaryIndexManager.maybeBuildSecondaryIndexes(SecondaryIndexManager.java:140)
at org.apache.cassandra.streaming.StreamReceiveTask$OnCompletionRunnable.run(StreamReceiveTask.java:113)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
at java.util.concurrent.FutureTask.run(FutureTask.java:262)

日志显示一个流接收器任务遇到墓碑压倒性的异常。我们认为这是关键因素,因为cassandra永远不会将状态变为正常。

以下是我们为解决此问题所采取的步骤。我们使用nodetool来压缩两个原始节点上的表和二级索引。

nohup nodetool compact production_alarm_keyspace object_daily_data &
nohup nodetool rebuild_index production_alarm_keyspace object_daily_data object_daily_data_alarm_no_idx_1 &

我们再次重新启动新节点,一小时后,新节点跳转到正常状态,直到现在一直正常工作。