Ganglia - gmetad - process is getting terminated by SIGSEGV

时间:2016-10-20 18:46:26

标签: sigsegv ganglia gmetad

I have started seeing this issue in the last couple of days. Ganglia gemtad process gets terminated within 5 min of its start with SIGSEGV (segfault)

This was stable since last few months..so not sure what changed.

Version - gmetad 3.7.1

I don't see any core dump or anything specific to gmetad in /var/log/messages or /var/log/secure either.

System snap (from top) at the time of this event

load average: 1.97, 0.99, 0.42

Memory also looks fairly Ok

 free -m
             total       used       free     shared    buffers     cached
Mem:          7989       3624       4364          0        333       2562
-/+ buffers/cache:        728       7260
Swap:         4095          0       4095

I have a superviord process that forks & watches the gmetad -

here is the supervisor log

2016-10-20 14:34:55,707 INFO exited: gmetad (terminated by SIGSEGV; not expected)
2016-10-20 14:34:55,707 INFO received SIGCLD indicating a child quit
2016-10-20 14:34:57,712 INFO spawned: 'gmetad' with pid 24561
2016-10-20 14:34:59,929 INFO exited: gmetad (terminated by SIGSEGV; not expected)
2016-10-20 14:34:59,929 INFO received SIGCLD indicating a child quit
2016-10-20 14:35:02,932 INFO spawned: 'gmetad' with pid 24593
2016-10-20 14:35:04,897 INFO exited: gmetad (terminated by SIGSEGV; not expected)
2016-10-20 14:35:04,897 INFO received SIGCLD indicating a child quit
2016-10-20 14:35:08,903 INFO spawned: 'gmetad' with pid 24618
2016-10-20 14:35:11,257 INFO exited: gmetad (terminated by SIGSEGV; not expected)
2016-10-20 14:35:11,257 INFO received SIGCLD indicating a child quit
2016-10-20 14:35:12,257 INFO gave up: gmetad entered FATAL state, too many start retries too quickly

Has anyone faced this kind of issue with gmetad in particular? Appreciate any pointers.

1 个答案:

答案 0 :(得分:0)

我能够确定问题并解决。

一些关键步骤/发现 -

  1. 更改' debug_level'到> 1在gmetad.conf中运行前台的gmetaa并吐出详细的日志。
  2. 我发现gmetad进程在一个完全相同的点被杀死 - 当它试图为特定data_source的特定节点处理文件时。
  3. 你可以注释掉所有其他的' data_source'从gmetad.conf并尝试隔离哪个data_source->节点有问题。
  4. 在找出有问题的节点之后,我刚删除了/ path / to / rrd / node_dir / file_with_issue或整个dir本身。 (需要找到更好的方法,因为这是数据丢失)
  5. 更改debug_level并重新启动gmetad!
  6. 就我而言,要指出一个文件名 - ' part_max_used.rrd'是/ path / to / ganglia / rrds / node_name下的文件名是SIGSEGV的根本原因

    希望这会有所帮助 - )