Question

我目前正面临一个问题，我不时会在我的辅助NameNode上看到RPC延迟问题。该实例的日志事件如下所示：

The health test result for NAME_NODE_RPC_LATENCY has become bad: The moving average of the RPC latency is 6 second(s) over the previous 5 minute(s). The moving average of the queue time is 0 second(s). The moving average of the processing time is 6 second(s). Critical threshold: 5 second(s). 
Time: Sep 25, 2015 5:52:02 AM

我们不时会看到这些RPC错误。我查看了日志，看不出有什么不同。

我在问题发生时检查了日志，发现没什么不寻常的

Call#0 Retry#0: org.apache.hadoop.ipc.StandbyException: Operation category READ is not supported in state standby

这通常是因为客户端不知道他连接到哪里，并且由于该节点处于待机状态，因此它无缝连接到活动NN。

我检查了RPC avg队列的时间和处理时间，有一次我看到连接中有一个突发，我们得到了一个警报，但是另一次当它变坏时，请求没有突发。

有什么建议吗？还有什么我可以检查的吗？

Answer 1

Call#0 Retry#0: org.apache.hadoop.ipc.StandbyException: Operation category READ is not supported in state standby is due to a BUG.

https://issues.apache.org/jira/browse/AMBARI-13373
on similar lines. 
If you have HA enabled:
Datanodes are trying to connect to standby namenode because you might have the standby namenode in your dfs.namenode.rpc-address check hdfs-site.xml.

Workaround: Remove this property because dfs.namenode.rpc-address.DEMOMASTER.nn1 and dfs.namenode.rpc-address.DEMOMASTER.nn1 will serve the purpose of dfs.namenode.rpc-address 

How to remove?

Use the  configs.sh utility on the Ambari Server to delete the extra property.

/var/lib/ambari-server/resources/scripts/configs.sh -u
<admin.user> -p
<admin.password> delete
<ambari.server>
<cluster.name> hdfs-site “dfs.namenode.rpc-address”

Where
admin.user and
admin.password are credentials for an Ambari Administrator,
ambari.server is the Ambari Server host and
cluster.name is the name of your cluster.

待机NN RPC延迟问题

1 个答案: