HDP群集上的ambari + ambari-metrics-collector服务未启动

时间:2020-06-25 10:19:36

标签: metrics rhel ambari hdp

我们在ambari-metrics-collector服务上遇到了一些问题,(我们有HDP集群版本-2.6.4,有8个节点)

ambari指标收集器服务无法启动或启动几秒钟然后失败

enter image description here

有关指标收集器版本的详细信息

rpm -qa | grep metrics
ambari-metrics-grafana-2.6.1.0-143.x86_64
ambari-metrics-monitor-2.6.1.0-143.x86_64
ambari-metrics-collector-2.5.0.3-7.x86_64
ambari-metrics-hadoop-sink-2.6.1.0-143.x86_64

所有计算机均为rhel 7.2

我们执行了以下步骤以解决问题

1。重新启动指标收集器服务

su - ams -c '/usr/sbin/ambari-metrics-collector --config /etc/ambari-metrics-collector/conf/ stop'
su - ams -c '/usr/sbin/ambari-metrics-collector --config /etc/ambari-metrics-collector/conf/ start'

or

ambari-metrics-collector stop 
ambari-metrics-collector start

2。在所有节点上重新启动ambari-metrics-monitor

 ambari-metrics-monitor stop
 ambari-metrics-monitor start

3。清理文件夹/var/lib/ambari-metrics-collector/hbase-tmp/zookeeper/

mv /var/lib/ambari-metrics-collector/hbase-tmp/zookeeper/zookeeper_0 /tmp/bck/zookeeper/

然后重新启动指标收集器服务

4。根据-https://docs.cloudera.com/HDPDocuments/Ambari-2.2.1.0/bk_ambari_reference_guide/content/_ams_general_guidelines.html

调整指标收集器参数

我们在ambari中更新以下参数

metrics_collector_heap_size=1024
hbase_regionserver_heapsize=1024
hbase_master_heapsize=512
hbase_master_xmn_size=128

目前的状态:步骤1-4无效

从日志中,我们可以看到以下内容:

日志文件-ambari-metrics-collector.log

2020-06-25 09:06:14,474 WARN org.apache.zookeeper.ClientCnxn: Session 0x172eab71f310002 for server null, unexpected error, closing socket connection and attempting reconnect
java.net.ConnectException: Connection refused
        at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
        at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:717)
        at org.apache.zookeeper.ClientCnxnSocketNIO.doTransport(ClientCnxnSocketNIO.java:361)
        at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1141)
2020-06-25 09:06:14,575 WARN org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper: Possibly transient ZooKeeper, quorum=master02.sys671.com:61181, exception=org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode = ConnectionLoss for /ams-hbase-unsecure/meta-region-server

日志文件-hbase-ams-master-master02.sys671.com.log

2020-06-25 09:38:18,799 WARN  [RS:0;master02:51842-SendThread(master02.sys671.com:61181)] zookeeper.ClientCnxn: Session 0x172ead5d73a0004 for server null, unexpected error, closing socket connection and attempting reconnect
java.net.ConnectException: Connection refused
        at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
        at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:717)
        at org.apache.zookeeper.ClientCnxnSocketNIO.doTransport(ClientCnxnSocketNIO.java:361)
        at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1125)
2020-06-25 09:38:20,437 INFO  [main-SendThread(master02.sys671.com:61181)] zookeeper.ClientCnxn: Opening socket connection to server master02.sys671.com/23.2.35.171:61181. Will not attempt to authenticate using SASL (unknown error)
2020-06-25 09:38:20,438 WARN  [main-SendThread(master02.sys671.com:61181)] zookeeper.ClientCnxn: Session 0x172ead5d73a0002 for server null, unexpected error, closing socket connection and attempting reconnect
java.net.ConnectException: Connection refused

我们也没有看到端口正在监听(timeline.metrics.service.webapp.address)

netstat -tulpn  | grep  6188

任何建议如何从这一点继续?

我们将很高兴获得有关此问题的任何帮助

0 个答案:

没有答案