Question

I have started seeing this issue in the last couple of days. Ganglia gemtad process gets terminated within 5 min of its start with SIGSEGV (segfault)

This was stable since last few months..so not sure what changed.

Version - gmetad 3.7.1

I don't see any core dump or anything specific to gmetad in /var/log/messages or /var/log/secure either.

System snap (from top) at the time of this event

load average: 1.97, 0.99, 0.42

Memory also looks fairly Ok

 free -m
             total       used       free     shared    buffers     cached
Mem:          7989       3624       4364          0        333       2562
-/+ buffers/cache:        728       7260
Swap:         4095          0       4095

I have a superviord process that forks & watches the gmetad -

here is the supervisor log

2016-10-20 14:34:55,707 INFO exited: gmetad (terminated by SIGSEGV; not expected)
2016-10-20 14:34:55,707 INFO received SIGCLD indicating a child quit
2016-10-20 14:34:57,712 INFO spawned: 'gmetad' with pid 24561
2016-10-20 14:34:59,929 INFO exited: gmetad (terminated by SIGSEGV; not expected)
2016-10-20 14:34:59,929 INFO received SIGCLD indicating a child quit
2016-10-20 14:35:02,932 INFO spawned: 'gmetad' with pid 24593
2016-10-20 14:35:04,897 INFO exited: gmetad (terminated by SIGSEGV; not expected)
2016-10-20 14:35:04,897 INFO received SIGCLD indicating a child quit
2016-10-20 14:35:08,903 INFO spawned: 'gmetad' with pid 24618
2016-10-20 14:35:11,257 INFO exited: gmetad (terminated by SIGSEGV; not expected)
2016-10-20 14:35:11,257 INFO received SIGCLD indicating a child quit
2016-10-20 14:35:12,257 INFO gave up: gmetad entered FATAL state, too many start retries too quickly

Has anyone faced this kind of issue with gmetad in particular? Appreciate any pointers.

Answer 1

我能够确定问题并解决。

一些关键步骤/发现 -

更改＆＃39; debug_level＆＃39;到＆gt; 1在gmetad.conf中运行前台的gmetaa并吐出详细的日志。
我发现gmetad进程在一个完全相同的点被杀死 - 当它试图为特定data_source的特定节点处理文件时。
你可以注释掉所有其他的＆＃39; data_source＆＃39;从gmetad.conf并尝试隔离哪个data_source-＆gt;节点有问题。
在找出有问题的节点之后，我刚删除了/ path / to / rrd / node_dir / file_with_issue或整个dir本身。（需要找到更好的方法，因为这是数据丢失）
更改debug_level并重新启动gmetad！

就我而言，要指出一个文件名 - ＆＃39; part_max_used.rrd＆＃39;是/ path / to / ganglia / rrds / node_name下的文件名是SIGSEGV的根本原因

希望这会有所帮助 - ）

Ganglia - gmetad - process is getting terminated by SIGSEGV

1 个答案: