Question

我们在AWS上使用cassandra 3.0.3，每台6台r3.xlarge机器（64G RAM，16核心），2台数据中心有6台机器，但这个特定的密钥空间仅在一个DC中复制3个节点。我们在cassandra中写了大约300M行作为每周同步。

在加载数据在机器上的负载系数高达34 和100％CPU利用率（在这种情况下会重写大量数据）期间，我们预计它会很慢但是其中一个节点的性能下降很明显。

在快照中，机器的负载系数输出：

On Overloaded Machine:
27.47, 29.78, 30.06

On other two:
2.65, 3.95, 4.59
3.76, 2.52, 2.50

nodetool status输出：

Overloaded Node:
UN  10.21.56.21    65.94 GB   256          38.7%             57f35206-f264-44ec-b588-f72883139f69  rack1

Other two Nodes:
UN  10.21.56.20    56.34 GB   256          31.9%             2b29f85c-c783-4e20-8cea-95d4e2688550  rack1
UN  10.21.56.23    51.29 GB   256          29.4%             fbf26f1d-1766-4f12-957c-7278fd19c20c  rack1

我可以看到sstable计数也很高，sstable刷新大小约为15MB。堆大小为8GB，使用G1GC。

nodetool cfhistograms的输出显示写入和读取延迟之间的明显差异，如下面的一个较大的表所示：

| Percentile    |  SSTables     |  Write Latency    |  Read Latency     |  Partition Size   |  Cell Count   |
|-------------  |------------   |-----------------  |----------------   |------------------ |-------------- |
|               |  (micros)     |  (micros)         |  (bytes)          |                   |               |
| 50%           | 8             | 20.5              | 1629.72           | 179               | 5             |
| 75%           | 10            | 24.6              | 2346.8            | 258               | 10            |
| 95%           | 12            | 42.51             | 4866.32           | 1109              | 72            |
| 98%           | 14            | 51.01             | 10090.81          | 3973              | 258           |
| 99%           | 14            | 61.21             | 14530.76          | 9887              | 642           |
| Min           | 0             | 4.77              | 11.87             | 104               | 5             |
| Max           | 17            | 322381.14         | 17797419.59       | 557074610         | 36157190      |

nodetool proxyhistogram输出可以在下面找到：

Percentile      Read Latency     Write Latency     Range Latency
                    (micros)          (micros)          (micros)
50%                   263.21            654.95          20924.30
75%                   654.95            785.94          30130.99
95%                  1629.72          36157.19          52066.35
98%                  4866.32         155469.30          62479.63
99%                  7007.51         322381.14          74975.55
Min                     6.87             11.87             24.60
Max              12359319.16       30753941.06       63771372.18

我在这里可以观察到的一个奇怪的事情是，每台机器的突变计数差异很大：

MutationStage Pool Completed Total:
Overloaded Node: 307531460526
Other Node1: 77979732754
Other Node2: 146376997379

这里重载的节点总数= ~4x其他节点1和~2x其他节点2。在具有MM3分区器的良好分布式密钥空间中，这种情况是期望的吗？

nodetool cfstats输出附在下面以供参考：

Keyspace: cat-48
    Read Count: 122253245
    Read Latency: 1.9288832487759324 ms.
    Write Count: 122243273
    Write Latency: 0.02254735837284069 ms.
    Pending Flushes: 0
        Table: bucket_distribution
        SSTable count: 11
        Space used (live): 10149121447
        Space used (total): 10149121447
        Space used by snapshots (total): 0
        Off heap memory used (total): 14971512
        SSTable Compression Ratio: 0.637019014259346
        Number of keys (estimate): 2762585
        Memtable cell count: 255915
        Memtable data size: 19622027
        Memtable off heap memory used: 0
        Memtable switch count: 487
        Local read count: 122253245
        Local read latency: 2.116 ms
        Local write count: 122243273
        Local write latency: 0.025 ms
        Pending flushes: 0
        Bloom filter false positives: 17
        Bloom filter false ratio: 0.00000
        Bloom filter space used: 9588144
        Bloom filter off heap memory used: 9588056
        Index summary off heap memory used: 3545264
        Compression metadata off heap memory used: 1838192
        Compacted partition minimum bytes: 104
        Compacted partition maximum bytes: 557074610
        Compacted partition mean bytes: 2145
        Average live cells per slice (last five minutes): 8.83894307680672
        Maximum live cells per slice (last five minutes): 5722
        Average tombstones per slice (last five minutes): 1.0
        Maximum tombstones per slice (last five minutes): 1
----------------

此外，我可以在nodetool tpstats中观察到，在峰值负载时，一个节点（正在过载）具有待处理的本地传输请求：

Overloaded Node:
Native-Transport-Requests        32        11      651595401         0               349
MutationStage                    32        41   316508231055         0                 0

The other two:
Native-Transport-Requests         0         0      625706001         0               495
MutationStage                     0         0   151442471377         0                 0
Native-Transport-Requests         0         0      630331805         0               219
MutationStage                     0         0    78369542703         0                 0

我还检查了nodetool compactionstats，输出在大多数情况下是0，有时压缩发生时，观察到负载并没有惊人地增加。

Answer 1

追溯到问题数据模型＆amp;在我们使用的内核中没有修补的内核错误。我们编写的数据中的一些分区很大，导致写请求不平衡，因为RF为1，因此一台服务器似乎负载很重。

这里详细描述了内核问题（简言之，它会影响使用park等待的java应用程序）：datastax blog

这是Linux Commit

修正的

Cassandra在重写时写得越来越慢 - 群集中的一台计算机上的负载因子飙升

1 个答案: