Logstash消耗大量的CPU

时间:2016-05-04 07:22:35

标签: elasticsearch logstash kafka-consumer-api logstash-grok logstash-configuration

详细说明:

  • dockerized environment,isolated,only ELK
  • logstash版本2。3。1(2016年4月4日)
  • logstash-输入卡夫卡-2.0.6
  • elasticsearch output
  • 200记录/秒
  • 8核,8GB RAM,2GB logstash堆(cpu消​​耗(“ni”)在所有8个核心上达到100%)
  • 除了elasticsearch之外没有其他进程消耗超过1%的cpu / mem
  • 经过一段时间(通常是几个小时)后,logstash停止向ES发送数据
  • logstash日志中没有任何内容
  • 当logstash服务重新启动时,一切正常工作

logstash进程:

USER       PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
systemd+ 15395  457  5.3 6953124 424628 ?      SNl  08:57  63:19 /usr/bin/java -XX:+UseParNewGC -XX:+UseConcMarkSweepGC -Djava.awt.headless=true -XX:CMSInitiatingOccupancyFraction=75 -XX:+UseCMSInitiatingOccupancyOnly -XX:+HeapDumpOnOutOfMemoryError -Djava.io.tmpdir=/opt/logstash -Xmx2g -Xss2048k -Djffi.boot.library.path=/opt/logstash/vendor/jruby/lib/jni -XX:+UseParNewGC -XX:+UseConcMarkSweepGC -Djava.awt.headless=true -XX:CMSInitiatingOccupancyFraction=75 -XX:+UseCMSInitiatingOccupancyOnly -XX:+HeapDumpOnOutOfMemoryError -Djava.io.tmpdir=/opt/logstash -XX:HeapDumpPath=/opt/logstash/heapdump.hprof -Xbootclasspath/a:/opt/logstash/vendor/jruby/lib/jruby.jar -classpath : -Djruby.home=/opt/logstash/vendor/jruby -Djruby.lib=/opt/logstash/vendor/jruby/lib -Djruby.script=jruby -Djruby.shell=/bin/sh org.jruby.Main --1.9 /opt/logstash/lib/bootstrap/environment.rb logstash/runner.rb agent -f /etc/logstash/conf.d -l /var/log/logstash/logstash.log --auto-reload

Logstash子进程(top -Hp)

  PID USER      PR  NI    VIRT    RES    SHR S %CPU %MEM     TIME+ COMMAND                                                                                               
 4637 systemd+  39  19 7045796 2,201g   5728 R 97,7 28,8 292:53.17 [main]>worker7                                                                                        
 4632 systemd+  39  19 7045796 2,201g   5728 R 97,3 28,8 140:04.15 [main]>worker2                                                                                        
 4635 systemd+  39  19 7045796 2,201g   5728 R 97,3 28,8 223:43.32 [main]>worker5                                                                                        
 4630 systemd+  39  19 7045796 2,201g   5728 R 97,0 28,8 281:26.59 [main]>worker0                                                                                        
 4631 systemd+  39  19 7045796 2,201g   5728 R 96,3 28,8 160:18.87 [main]>worker1                                                                                        
 4636 systemd+  39  19 7045796 2,201g   5728 R 96,3 28,8 134:52.64 [main]>worker6                                                                                        
 4517 systemd+  39  19 7045796 2,201g   5728 R 79,7 28,8 183:59.43 LogStash::Runne                                                                                       
 4493 systemd+  39  19 7045796 2,201g   5728 S  1,7 28,8  33:23.33 java                                                                                                  
 4495 systemd+  39  19 7045796 2,201g   5728 S  1,7 28,8  33:18.66 java                                                                                                  
 4497 systemd+  39  19 7045796 2,201g   5728 S  1,7 28,8  33:22.12 java                                                                                                  
 4498 systemd+  39  19 7045796 2,201g   5728 S  1,7 28,8  33:21.07 java                                                                                                  
 4499 systemd+  39  19 7045796 2,201g   5728 S  1,7 28,8  33:22.47 java                                                                                                  
 4494 systemd+  39  19 7045796 2,201g   5728 S  1,3 28,8  33:23.83 java                                                                                                  
 4496 systemd+  39  19 7045796 2,201g   5728 S  0,7 28,8  33:22.12 java                                                                                                  
 4500 systemd+  39  19 7045796 2,201g   5728 S  0,7 28,8  33:22.44 java                                                                                                  
 4633 systemd+  39  19 7045796 2,201g   5728 S  0,7 28,8   3:22.78 [main]>worker3                                                                                        
20526 systemd+  39  19 7045796 2,201g   5728 S  0,7 28,8   0:15.27 [main]<kafka                                                                                          
 4492 systemd+  39  19 7045796 2,201g   5728 S  0,3 28,8   0:23.78 LogStash::Runne                                                                                       
 4532 systemd+  39  19 7045796 2,201g   5728 S  0,3 28,8   0:55.25 [main]<kafka                                                                                          
 4536 systemd+  39  19 7045796 2,201g   5728 S  0,3 28,8   1:29.43 [main]<kafka                                                                                          
 4606 systemd+  39  19 7045796 2,201g   5728 S  0,3 28,8   0:03.88 [main]<kafka                                                                                          
 4638 systemd+  39  19 7045796 2,201g   5728 S  0,3 28,8   0:48.34 pipeline.main                                                                                         
20529 systemd+  39  19 7045796 2,201g   5728 S  0,3 28,8   0:14.41 [main]<kafka                                                                                          
20533 systemd+  39  19 7045796 2,201g   5728 S  0,3 28,8   0:15.62 [main]<kafka           

Logstash配置

像这样的几个街区:

kafka {
    zk_connect => "pbdcnt1:2181,pbdcnt2:2181,pbdcnt3:2181"
    topic_id => "transactions"
    tags => ["transactions"]
    group_id => "prod_elastic_fraud"
    codec => plain {
        charset => "Windows-1250"
    }
}

有几个像这样:

filter {
    if "transactions" in [tags] {
        grok{
            match=>["message","(?<log_time>.*?)? ?(?<src>.*?)? ?(?<insert_time>.*?);(?<col1>.*?);(?<col2>.*?);...;(?<colN>.*?)"]
        }
    }

输出是这样的:

output {
  if [tags] {
    elasticsearch {
      hosts => ["localhost"]
      index => "%{tags}-%{+YYYY.MM.dd}"
    }
  }
}

某处我的错? 如果需要更多数据,请询问。

谢谢, 埃迪

更新:

在某处泄漏?垃圾收集时间不断增加(jstat)

S0     S1     E      O      P     YGC     YGCT    FGC    FGCT     GCT    LGCC                 GCC  

64.69   0.00  38.41  13.47  59.96  57906 2977.233  4916  206.461 3183.694 Allocation Failure   No GC               
  0.00  51.04  27.61  13.48  59.96  57907 2977.291  4916  206.461 3183.753 Allocation Failure   No GC               
 38.59  51.04 100.00  13.48  59.96  57908 2977.291  4916  206.461 3183.753 Allocation Failure   Allocation Failure  
 57.04   0.00  77.47  13.48  59.96  57908 2977.342  4916  206.461 3183.803 Allocation Failure   No GC               
  0.00  72.64  81.57  13.48  59.96  57909 2977.399  4916  206.461 3183.860 Allocation Failure   No GC               
 61.65   0.00  34.59  13.48  59.96  57910 2977.448  4916  206.461 3183.909 Allocation Failure   No GC               
  0.00  45.05   8.99  13.48  59.96  57911 2977.492  4916  206.461 3183.953 Allocation Failure   No GC               
 48.47   0.00   7.30  13.49  59.96  57912 2977.539  4916  206.461 3184.000 Allocation Failure   No GC               
 48.47   0.00  72.43  13.49  59.96  57912 2977.539  4916  206.461 3184.000 Allocation Failure   No GC               
  0.00  53.80  43.27  13.49  59.96  57913 2977.588  4916  206.461 3184.050 Allocation Failure   No GC               
 59.96   0.00  49.70  13.49  59.96  57914 2977.636  4916  206.461 3184.098 Allocation Failure   No GC               
 59.96   0.00  95.84  13.49  59.96  57914 2977.636  4916  206.461 3184.098 Allocation Failure   No GC               
  0.00  54.05  84.24  13.49  59.96  57915 2977.683  4916  206.461 3184.145 Allocation Failure   No GC               
 65.37   0.00  84.28  13.49  59.96  57916 2977.743  4916  206.461 3184.205 Allocation Failure   No GC               
  0.00  48.62  48.73  13.49  59.96  57917 2977.798  4916  206.461 3184.259 Allocation Failure   No GC               
 53.00   0.00  24.35  13.50  59.96  57918 2977.845  4916  206.461 3184.307 Allocation Failure   No GC               
  0.00  52.87  19.08  13.50  59.96  57919 2977.900  4916  206.461 3184.361 Allocation Failure   No GC               
  0.00  52.87  84.52  13.50  59.96  57919 2977.900  4916  206.461 3184.361 Allocation Failure   No GC               
 72.04   0.00  60.68  13.50  59.96  57920 2977.954  4916  206.461 3184.415 Allocation Failure   No GC               
  0.00  65.02  64.85  13.51  59.96  57921 2978.009  4916  206.461 3184.471 Allocation Failure   No GC               
 58.54   0.00  40.52  13.51  59.96  57922 2978.058  4916  206.461 3184.520 Allocation Failure   No GC               
 58.54   0.00  98.01  13.51  59.96  57922 2978.058  4916  206.461 3184.520 Allocation Failure   No GC               
  0.00  58.06  99.74  13.51  59.96  57924 2978.107  4916  206.461 3184.569 Allocation Failure   Allocation Failure  
 69.63   0.00  81.55  13.51  59.96  57924 2978.163  4916  206.461 3184.624 Allocation Failure   No GC               
  0.00  56.36  38.21  13.51  59.96  57925 2978.211  4916  206.461 3184.672 Allocation Failure   No GC               
 52.32   0.00  46.66  13.51  59.96  57926 2978.260  4916  206.461 3184.721 Allocation Failure   No GC               
  0.00  67.84  18.10  13.51  59.96  57927 2978.312  4916  206.461 3184.774 Allocation Failure   No GC               

进程的jstack: https://www.dropbox.com/s/u6tz0w4v0in8lev/logstash.jstack?dl=0

1 个答案:

答案 0 :(得分:0)

可悲的是,正是消耗太多CPU的正则表达式。当我删除正则表达式并使用带有列选项的CSV过滤器时(这基本上就是我想要做的事情),CPU现在一直在150%左右(对于200次事务来说似乎也太高了)。但是,嘿,这就是目前的情况。

什么保持迷茫是为什么它有效一段时间然后停止。也许会增加工人......