Question

我有一个带有四个节点的cassandra 2.0.6群集。 cassandra遇到了不一致的问题。我使用nodetool status来检查每个节点上的状态。结果不一致。此状态命令运行速度非常慢。以下是每个节点的命令结果。

ip 192.168.148.181和192.168.148.121的节点是种子节点。群集从未运行过修复。

此外，181和121上的cpu使用率非常高，日志显示CMS GC在这些节点上非常频繁。我断开所有客户端的连接，没有读写加载。这种一致性和高GC持续存在。

那么如何调试和优化这个集群呢？

[cassandra@whaty121 apache-cassandra-2.0.16]$ time bin/nodetool status
Note: Ownership information does not include topology; for complete information, specify a keyspace
Datacenter: DC1
===============
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
--  Address          Load       Tokens  Owns   Host ID                               Rack
UN  192.168.148.121  10.86 GB   1       25.0%  1d9ba597-c404-481f-af2b-436493c57227  RAC2
UN  192.168.148.181  10.53 GB   1       25.0%  5d90300f-2fb4-4065-9819-10ece285223d  RAC1
DN  192.168.148.182  10.95 GB   1       25.0%  bcb550df-9429-4cae-9fd2-0bfeea9a5649  RAC4
UN  192.168.148.221  10.49 GB   1       25.0%  6867f8b4-1f54-48fc-aaae-da71bc251970  RAC3

real    8m50.506s
user    39m48.718s
sys     76m48.566s
--------------------------------------------------------------------------------
[cassandra@whaty221 apache-cassandra-2.0.16]$  time bin/nodetool status
Note: Ownership information does not include topology; for complete information, specify a keyspace
Datacenter: DC1
===============
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
--  Address          Load       Tokens  Owns   Host ID                               Rack
DN  192.168.148.121  10.86 GB   1       25.0%  1d9ba597-c404-481f-af2b-436493c57227  RAC2
UN  192.168.148.181  10.53 GB   1       25.0%  5d90300f-2fb4-4065-9819-10ece285223d  RAC1
DN  192.168.148.182  10.95 GB   1       25.0%  bcb550df-9429-4cae-9fd2-0bfeea9a5649  RAC4
UN  192.168.148.221  10.49 GB   1       25.0%  6867f8b4-1f54-48fc-aaae-da71bc251970  RAC3

real    0m15.075s
user    0m1.606s
sys     0m0.393s
----------------------------------------------------------------------
[cassandra@whaty181 apache-cassandra-2.0.16]$  time bin/nodetool status
Note: Ownership information does not include topology; for complete information, specify a keyspace
Datacenter: DC1
===============
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
--  Address          Load       Tokens  Owns   Host ID                               Rack
DN  192.168.148.121  10.86 GB   1       25.0%  1d9ba597-c404-481f-af2b-436493c57227  RAC2
UN  192.168.148.181  10.53 GB   1       25.0%  5d90300f-2fb4-4065-9819-10ece285223d  RAC1
UN  192.168.148.182  10.95 GB   1       25.0%  bcb550df-9429-4cae-9fd2-0bfeea9a5649  RAC4
UN  192.168.148.221  10.49 GB   1       25.0%  6867f8b4-1f54-48fc-aaae-da71bc251970  RAC3

real    0m25.719s
user    0m2.152s
sys     0m1.228s
-------------------------------------------------------------------------

[cassandra@whaty182 apache-cassandra-2.0.16]$  time bin/nodetool status
Note: Ownership information does not include topology; for complete information, specify a keyspace
Datacenter: DC1
===============
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
--  Address          Load       Tokens  Owns   Host ID                               Rack
DN  192.168.148.121  10.86 GB   1       25.0%  1d9ba597-c404-481f-af2b-436493c57227  RAC2
DN  192.168.148.181  10.53 GB   1       25.0%  5d90300f-2fb4-4065-9819-10ece285223d  RAC1
UN  192.168.148.182  10.95 GB   1       25.0%  bcb550df-9429-4cae-9fd2-0bfeea9a5649  RAC4
DN  192.168.148.221  10.49 GB   1       25.0%  6867f8b4-1f54-48fc-aaae-da71bc251970  RAC3

real    0m17.581s
user    0m1.843s
sys     0m1.632s

我打印gc的对象详细信息：

 num     #instances         #bytes  class name
----------------------------------------------
   1:      58584535     1874705120  java.util.concurrent.FutureTask
   2:      58585802     1406059248  java.util.concurrent.Executors$RunnableAdapter
   3:      58584601     1406030424  java.util.concurrent.LinkedBlockingQueue$Node
   4:      58584534     1406028816  org.apache.cassandra.db.compaction.CompactionManager$BackgroundCompactionTask
   5:        214682       24087416  [B
   6:        217294       10430112  java.nio.HeapByteBuffer
   7:         37591        5977528  [C
   8:         41843        5676048  <constMethodKlass>
   9:         41843        5366192  <methodKlass>
  10:          4126        4606080  <constantPoolKlass>
  11:        100060        4002400  org.apache.cassandra.io.sstable.IndexHelper$IndexInfo
  12:          4126        2832176  <instanceKlassKlass>
  13:          4880        2686216  [J
  14:          3619        2678784  <constantPoolCacheKlass>

我在一个节点上使用nodetool cfstats并发现许多压缩任务已在3天内累积（我在3天前重新启动了群集）

[cassandra@whaty181 apache-cassandra-2.0.16]$ bin/nodetool compactionstats
pending tasks: 64642341
Active compaction remaining time :        n/a

我检查了compactionhistory。这是结果的一部分。它显示了许多与密钥空间系统相关的记录。

Compaction History: 
id                                       keyspace_name      columnfamily_name            compacted_at              bytes_in       bytes_out      rows_merged
8e4f8830-b04f-11e5-a211-45b7aa88107c     system             sstable_activity             1451629144115             3342           915            {4:23}
96a6fcb0-b04b-11e5-a211-45b7aa88107c     system             hints                        1451627440123             18970740       18970740       {1:1}
7c42c940-adac-11e5-8bd4-45b7aa88107c     system             hints                        1451339203540             56969835       56782732       {2:3}
585b97a0-ad98-11e5-8bd4-45b7aa88107c     system             sstable_activity             1451330553370             3700           956            {4:24}
aefc3f10-b1b2-11e5-a211-45b7aa88107c     system             sstable_activity             1451781670273             3201           906            {4:23}
1e76f1b0-b180-11e5-a211-45b7aa88107c     system             sstable_activity             1451759952971             3303           700            {4:23}
e7b75b70-aec2-11e5-8bd4-45b7aa88107c     system             hints                        1451458783911             57690316       57497847       {2:3}
ad102280-af6d-11e5-b1dc-45b7aa88107c     webtrn_study_log_formallySCORM_STU_COURSE             1451532129448             45671877       41137664       {1:11, 3:1, 4:8}
60906970-aec7-11e5-8bd4-45b7aa88107c     system             sstable_activity             1451460704647             3751           974            {4:25}
88aed310-ad91-11e5-8bd4-45b7aa88107c     system             hints                        1451327627969             56984347       56765328       {2:3}
3ad14f00-af6d-11e5-b1dc-45b7aa88107c     webtrn_study_log_formallySCORM_STU_COURSE             1451531937776             46696097       38827028       {1:8, 3:2, 4:9}
84df8fb0-b00f-11e5-a211-45b7aa88107c     system             hints                        1451601640491             18970740       18970740       {1:1}
657482e0-ad33-11e5-8bd4-45b7aa88107c     system             sstable_activity             1451287196174             3701           931            {4:24}
9cc8af70-b24a-11e5-a211-45b7aa88107c     system             sstable_activity             1451846923239             3134           773            {4:23}
dcbe5e30-afd0-11e5-a211-45b7aa88107c     system             sstable_activity             1451574729619             3357           790            {4:23}
b285ced0-afa0-11e5-84e3-45b7aa88107c     system             hints                        1451554042941             43310718       42137761       {1:1, 2:2}
119770e0-ad4e-11e5-8bd4-45b7aa88107c     system             hints                        1451298651886             57397441       57190519       {2:3}
f1bb37a0-b204-11e5-a211-45b7aa88107c     system             hints                        1451817000986             17713746

我尝试使用高gc刷新节点，但是在读取超时时它返回失败。

群集只接收要插入的数据。我关闭了客户端写入并在这3天内重新启动了群集。压实任务仍在积累。

Answer 1

nodetool状态输出不一致无需担心。这是拥有大量GC的结果。在GC期间，其他节点gossipers认为节点已关闭。然后当你有很多GC时，节点很快就会从DN切换到UN。

你必须了解在java堆中占用如此多空间的内容

你在cassandra日志中有任何StatusLogger吗？
使用nodetool cfstats，您看到任何system.hints吗？提示是协调器延迟的突变，以便在负载较低时提供。如果你的集群已经积累了很多提示，那么它会给堆压力并导致GC。
有任何压缩吗？ nodetool compaction stats
刷新所有列族是否会冷却群集？每个ks和cf的所有节点上都有nodetool flush

Cassandra nodetool状态在不同节点上不一致，有太多未决的压缩任务

1 个答案: