我有一个带有四个节点的cassandra 2.0.6群集。 cassandra遇到了不一致的问题。我使用nodetool status来检查每个节点上的状态。结果不一致。此状态命令运行速度非常慢。以下是每个节点的命令结果。
ip 192.168.148.181和192.168.148.121的节点是种子节点。群集从未运行过修复。
此外,181和121上的cpu使用率非常高,日志显示CMS GC在这些节点上非常频繁。我断开所有客户端的连接,没有读写加载。这种一致性和高GC持续存在。
那么如何调试和优化这个集群呢?
[cassandra@whaty121 apache-cassandra-2.0.16]$ time bin/nodetool status
Note: Ownership information does not include topology; for complete information, specify a keyspace
Datacenter: DC1
===============
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
-- Address Load Tokens Owns Host ID Rack
UN 192.168.148.121 10.86 GB 1 25.0% 1d9ba597-c404-481f-af2b-436493c57227 RAC2
UN 192.168.148.181 10.53 GB 1 25.0% 5d90300f-2fb4-4065-9819-10ece285223d RAC1
DN 192.168.148.182 10.95 GB 1 25.0% bcb550df-9429-4cae-9fd2-0bfeea9a5649 RAC4
UN 192.168.148.221 10.49 GB 1 25.0% 6867f8b4-1f54-48fc-aaae-da71bc251970 RAC3
real 8m50.506s
user 39m48.718s
sys 76m48.566s
--------------------------------------------------------------------------------
[cassandra@whaty221 apache-cassandra-2.0.16]$ time bin/nodetool status
Note: Ownership information does not include topology; for complete information, specify a keyspace
Datacenter: DC1
===============
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
-- Address Load Tokens Owns Host ID Rack
DN 192.168.148.121 10.86 GB 1 25.0% 1d9ba597-c404-481f-af2b-436493c57227 RAC2
UN 192.168.148.181 10.53 GB 1 25.0% 5d90300f-2fb4-4065-9819-10ece285223d RAC1
DN 192.168.148.182 10.95 GB 1 25.0% bcb550df-9429-4cae-9fd2-0bfeea9a5649 RAC4
UN 192.168.148.221 10.49 GB 1 25.0% 6867f8b4-1f54-48fc-aaae-da71bc251970 RAC3
real 0m15.075s
user 0m1.606s
sys 0m0.393s
----------------------------------------------------------------------
[cassandra@whaty181 apache-cassandra-2.0.16]$ time bin/nodetool status
Note: Ownership information does not include topology; for complete information, specify a keyspace
Datacenter: DC1
===============
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
-- Address Load Tokens Owns Host ID Rack
DN 192.168.148.121 10.86 GB 1 25.0% 1d9ba597-c404-481f-af2b-436493c57227 RAC2
UN 192.168.148.181 10.53 GB 1 25.0% 5d90300f-2fb4-4065-9819-10ece285223d RAC1
UN 192.168.148.182 10.95 GB 1 25.0% bcb550df-9429-4cae-9fd2-0bfeea9a5649 RAC4
UN 192.168.148.221 10.49 GB 1 25.0% 6867f8b4-1f54-48fc-aaae-da71bc251970 RAC3
real 0m25.719s
user 0m2.152s
sys 0m1.228s
-------------------------------------------------------------------------
[cassandra@whaty182 apache-cassandra-2.0.16]$ time bin/nodetool status
Note: Ownership information does not include topology; for complete information, specify a keyspace
Datacenter: DC1
===============
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
-- Address Load Tokens Owns Host ID Rack
DN 192.168.148.121 10.86 GB 1 25.0% 1d9ba597-c404-481f-af2b-436493c57227 RAC2
DN 192.168.148.181 10.53 GB 1 25.0% 5d90300f-2fb4-4065-9819-10ece285223d RAC1
UN 192.168.148.182 10.95 GB 1 25.0% bcb550df-9429-4cae-9fd2-0bfeea9a5649 RAC4
DN 192.168.148.221 10.49 GB 1 25.0% 6867f8b4-1f54-48fc-aaae-da71bc251970 RAC3
real 0m17.581s
user 0m1.843s
sys 0m1.632s
我打印gc的对象详细信息:
num #instances #bytes class name
----------------------------------------------
1: 58584535 1874705120 java.util.concurrent.FutureTask
2: 58585802 1406059248 java.util.concurrent.Executors$RunnableAdapter
3: 58584601 1406030424 java.util.concurrent.LinkedBlockingQueue$Node
4: 58584534 1406028816 org.apache.cassandra.db.compaction.CompactionManager$BackgroundCompactionTask
5: 214682 24087416 [B
6: 217294 10430112 java.nio.HeapByteBuffer
7: 37591 5977528 [C
8: 41843 5676048 <constMethodKlass>
9: 41843 5366192 <methodKlass>
10: 4126 4606080 <constantPoolKlass>
11: 100060 4002400 org.apache.cassandra.io.sstable.IndexHelper$IndexInfo
12: 4126 2832176 <instanceKlassKlass>
13: 4880 2686216 [J
14: 3619 2678784 <constantPoolCacheKlass>
我在一个节点上使用nodetool cfstats
并发现许多压缩任务已在3天内累积(我在3天前重新启动了群集)
[cassandra@whaty181 apache-cassandra-2.0.16]$ bin/nodetool compactionstats
pending tasks: 64642341
Active compaction remaining time : n/a
我检查了compactionhistory。这是结果的一部分。它显示了许多与密钥空间系统相关的记录。
Compaction History:
id keyspace_name columnfamily_name compacted_at bytes_in bytes_out rows_merged
8e4f8830-b04f-11e5-a211-45b7aa88107c system sstable_activity 1451629144115 3342 915 {4:23}
96a6fcb0-b04b-11e5-a211-45b7aa88107c system hints 1451627440123 18970740 18970740 {1:1}
7c42c940-adac-11e5-8bd4-45b7aa88107c system hints 1451339203540 56969835 56782732 {2:3}
585b97a0-ad98-11e5-8bd4-45b7aa88107c system sstable_activity 1451330553370 3700 956 {4:24}
aefc3f10-b1b2-11e5-a211-45b7aa88107c system sstable_activity 1451781670273 3201 906 {4:23}
1e76f1b0-b180-11e5-a211-45b7aa88107c system sstable_activity 1451759952971 3303 700 {4:23}
e7b75b70-aec2-11e5-8bd4-45b7aa88107c system hints 1451458783911 57690316 57497847 {2:3}
ad102280-af6d-11e5-b1dc-45b7aa88107c webtrn_study_log_formallySCORM_STU_COURSE 1451532129448 45671877 41137664 {1:11, 3:1, 4:8}
60906970-aec7-11e5-8bd4-45b7aa88107c system sstable_activity 1451460704647 3751 974 {4:25}
88aed310-ad91-11e5-8bd4-45b7aa88107c system hints 1451327627969 56984347 56765328 {2:3}
3ad14f00-af6d-11e5-b1dc-45b7aa88107c webtrn_study_log_formallySCORM_STU_COURSE 1451531937776 46696097 38827028 {1:8, 3:2, 4:9}
84df8fb0-b00f-11e5-a211-45b7aa88107c system hints 1451601640491 18970740 18970740 {1:1}
657482e0-ad33-11e5-8bd4-45b7aa88107c system sstable_activity 1451287196174 3701 931 {4:24}
9cc8af70-b24a-11e5-a211-45b7aa88107c system sstable_activity 1451846923239 3134 773 {4:23}
dcbe5e30-afd0-11e5-a211-45b7aa88107c system sstable_activity 1451574729619 3357 790 {4:23}
b285ced0-afa0-11e5-84e3-45b7aa88107c system hints 1451554042941 43310718 42137761 {1:1, 2:2}
119770e0-ad4e-11e5-8bd4-45b7aa88107c system hints 1451298651886 57397441 57190519 {2:3}
f1bb37a0-b204-11e5-a211-45b7aa88107c system hints 1451817000986 17713746
我尝试使用高gc刷新节点,但是在读取超时时它返回失败。
群集只接收要插入的数据。我关闭了客户端写入并在这3天内重新启动了群集。压实任务仍在积累。
答案 0 :(得分:3)
nodetool状态输出不一致无需担心。这是拥有大量GC的结果。在GC期间,其他节点gossipers认为节点已关闭。然后当你有很多GC时,节点很快就会从DN切换到UN。
你必须了解在java堆中占用如此多空间的内容
nodetool cfstats
,您看到任何system.hints吗?提示是协调器延迟的突变,以便在负载较低时提供。如果你的集群已经积累了很多提示,那么它会给堆压力并导致GC。nodetool compaction stats
nodetool flush