Cassandra Cluster为什么节点显示不可访问?

时间:2017-07-06 16:39:59

标签: cassandra

我在CentOS上有一个包含3个节点的Cassandra v3.9群集。复制因子= 2,复制策略为networktopologystrategy,单一数据中心。

其中一个节点不是"种子"节点,很多时候,当我做&node; nodetool describecluster",在某些情况下显示为不可访问,在某些情况下最多10分钟。之后,它再次显示为正常节点。当我查看/var/log/cassandra/debug.log时,我在其中一个种子节点上看到以下行:

  

" DEBUG [RMI TCP连接(118)-127.0.0.1] 2017-07-06 09:15:40,519 StorageProxy.java:2254 - 主机未达成一致。没有得到每个人的回复:10.0.0.113"

这是我的cassandra.yaml文件配置。类似的配置在其他2个节点上,只有IP地址发生变化。

    # Cassandra storage config YAML

# NOTE:
#   See http://wiki.apache.org/cassandra/StorageConfiguration for
#   full explanations of configuration directives
# /NOTE

# The name of the cluster. This is mainly used to prevent machines in
# one logical cluster from joining another.
cluster_name: 'MyCluster'

num_tokens: 256

<<some other default settings>>

# Directory where Cassandra should store hints.
# If not set, the default directory is $CASSANDRA_HOME/data/hints.
# hints_directory: /var/lib/cassandra/hints
hints_directory: /var/lib/cassandra/data/hints

<<some other default settings>>

# any class that implements the SeedProvider interface and has a
# constructor that takes a Map<String, String> of parameters will do.
seed_provider:
    - class_name: org.apache.cassandra.locator.SimpleSeedProvider
      parameters:
          # seeds is actually a comma-delimited list of addresses.
          # Ex: "<ip1>,<ip2>,<ip3>"
          - seeds: "10.0.0.111,10.0.0.112"

# For workloads with more data than can fit in memory, Cassandra's
# bottleneck will be reads that need to fetch data from
# disk. "concurrent_reads" should be set to (16 * number_of_drives) in
# order to allow the operations to enqueue low enough in the stack
# that the OS and drives can reorder them. Same applies to
# "concurrent_counter_writes", since counter writes read the current
# values before incrementing and writing them back.
#
# On the other hand, since writes are almost never IO bound, the ideal
# number of "concurrent_writes" is dependent on the number of cores in
# your system; (8 * number_of_cores) is a good rule of thumb.
concurrent_reads: 32
concurrent_writes: 32
concurrent_counter_writes: 32

<<some other default settings>>

# TCP port, for commands and data
# For security reasons, you should not expose this port to the internet.  Firewall it if needed.
storage_port: 7000

# SSL port, for encrypted communication.  Unused unless enabled in
# encryption_options
# For security reasons, you should not expose this port to the internet.  Firewall it if needed.
ssl_storage_port: 7001

listen_address: 10.0.0.113
listen_interface_prefer_ipv6: false

#broadcast_address: 134.207.255.11

#listen_on_broadcast_address: true

start_native_transport: true
native_transport_port: 9042

native_transport_max_frame_size_in_mb: 256

native_transport_max_concurrent_connections: -1

# The maximum number of concurrent client connections per source ip.
# The default is -1, which means unlimited.
native_transport_max_concurrent_connections_per_ip: -1

# Whether to start the thrift rpc server.
start_rpc: false


rpc_address: 10.0.0.113

rpc_interface_prefer_ipv6: false

# port for Thrift to listen for clients on
rpc_port: 9160

# RPC address to broadcast to drivers and other Cassandra nodes. This cannot
# be set to 0.0.0.0. If left blank, this will be set to the value of
# rpc_address. If rpc_address is set to 0.0.0.0, broadcast_rpc_address must
# be set.
broadcast_rpc_address: 134.207.255.11

# enable or disable keepalive on rpc/native connections
rpc_keepalive: true

<<some other default settings>>

# How long the coordinator should wait for read operations to complete
read_request_timeout_in_ms: 1800000
# How long the coordinator should wait for seq or index scans to complete
range_request_timeout_in_ms: 1800000
# How long the coordinator should wait for writes to complete
write_request_timeout_in_ms: 200000
# How long the coordinator should wait for counter writes to complete
counter_write_request_timeout_in_ms: 5000
# How long a coordinator should continue to retry a CAS operation
# that contends with other proposals for the same row
cas_contention_timeout_in_ms: 1000
# How long the coordinator should wait for truncates to complete
# (This can be much longer, because unless auto_snapshot is disabled
# we need to flush first so we can snapshot before removing the data.)
truncate_request_timeout_in_ms: 60000
# The default timeout for other, miscellaneous operations
request_timeout_in_ms: 10000

# Enable operation timeout information exchange between nodes to accurately
# measure request timeouts.  If disabled, replicas will assume that requests
# were forwarded to them instantly by the coordinator, which means that
# under overload conditions we will waste that much extra time processing
# already-timed-out requests.
#
# Warning: before enabling this property make sure to ntp is installed
# and the times are synchronized between the nodes.
cross_node_timeout: true

endpoint_snitch: GossipingPropertyFileSnitch

# controls how often to perform the more expensive part of host score
# calculation
dynamic_snitch_update_interval_in_ms: 100
# controls how often to reset all host scores, allowing a bad host to
# possibly recover
dynamic_snitch_reset_interval_in_ms: 600000

request_scheduler: org.apache.cassandra.scheduler.NoScheduler

<<some other default settings>>

Here is the system.log file from 10.0.0.113 server

发生上述错误时,我的应用程序表现得很奇怪而且失败了。 我在所有三个节点上都使用NTP。

问题是为什么节点10.0.0.113显示UNREACHABLE这么多次,以及如何永久修复它。我不想从群集中删除该节点。我想修复此错误并使节点始终可用。

提前谢谢。

0 个答案:

没有答案