风暴 - 主管在重启时崩溃

时间:2014-03-11 07:13:43

标签: apache-storm

这是一个让我疯狂的问题。我在本地局域网上运行一台机器Storm实例。我目前正在运行v0.9.1-incubating发布版本(来自the Apache Incubator site。问题只是我的storm supervisor进程拒绝启动 EVERY SINGLE reboot。黑客修复非常简单,从风暴本地目录中删除supervisorworkers文件夹并重新运行该过程;然后运行hunky dory直到下次重启。

我提供了一些我认为可能与调试此问题相关的信息。如果需要请提供更多信息,但只是帮我解决一下。

PS:我是否运行拓扑并不重要。

  1. Zookeeper版本:3.4.5
  2. 风暴版:0.9.1孵化(使用Netty传输)
  3. Storm和Zookeeper都在同一台机器上运行。
  4. supervisord version:3.0b2
  5. 操作系统:Ubuntu 12.04 LTS
  6. 处理器:AMD Phenom(tm)II X6 1055T处理器×6
  7. RAM:5.6 GiB
  8. 主管配置

    [program:zookeeper]
    command=/path/to/zookeeper/bin/zkServer.sh "start-foreground"
    process_name=zookeeper
    directory=/path/to/zookeeper/bin
    stdout_logfile=/var/log/zookeeper.log        ; stdout log path, NONE$
    stderr_logfile=/var/log/err.zookeeper.log        ; stderr log path, $
    priority=2
    user=root
    
    
    [program:storm-nimbus]
    command=/path/to/storm/bin/storm nimbus
    user=root
    autostart=true
    autorestart=true
    startsecs=10
    startretries=2
    log_stdout=true
    log_stderr=true
    stderr_logfile=/var/log/storm/nimbus.err.log
    stdout_logfile=/var/log/storm/nimbus.out.log
    logfile_maxbytes=20MB
    logfile_backups=2
    priority=10
    
    
    [program:storm-ui]
    command=/path/to/storm/bin/storm ui
    user=root
    autostart=true
    autorestart=true
    startsecs=10
    startretries=2
    log_stdout=true
    log_stderr=true
    stderr_logfile=/var/log/storm/ui.err.log
    stdout_logfile=/var/log/storm/ui.out.log
    logfile_maxbytes=20MB
    logfile_backups=2
    priority=500
    
    
    [program:storm-supervisor]
    command=/path/to/storm/bin/storm supervisor
    user=root
    autostart=true
    autorestart=true
    startsecs=10
    startretries=2
    log_stdout=true
    log_stderr=true
    stderr_logfile=/var/log/storm/supervisor.err.log
    stdout_logfile=/var/log/storm/supervisor.log.log
    logfile_maxbytes=20MB
    logfile_backups=2
    priority=600
    
    
    [program:storm-logviewer]
    command=/path/to/storm/bin/storm logviewer
    user=root
    autostart=true
    autorestart=true
    startsecs=10
    startretries=2
    log_stdout=true
    log_stderr=true
    stderr_logfile=/var/log/storm/log.err.log
    stdout_logfile=/var/log/storm/log.out.log
    logfile_maxbytes=20MB
    logfile_backups=2
    priority=900
    

    风暴配置

    #Zookeeper
    storm.zookeeper.servers:
         - "192.168.1.11"
    
    # Nimbus
    nimbus.host: "192.168.1.11"
    nimbus.childopts: '-Xmx1024m -Djava.net.preferIPv4Stack=true -Dprocess=storm'
    
    # UI
    ui.port: 9090
    ui.childopts: "-Xmx768m -Djava.net.preferIPv4Stack=true -Dprocess=storm"
    
    # Supervisor
    supervisor.childopts: '-Djava.net.preferIPv4Stack=true -Dprocess=storm'
    
    
    # Worker
    worker.childopts: '-Xmx768m -Djava.net.preferIPv4Stack=true -Dprocess=storm'
    
    storm.local.dir: "/path/to/storm"
    
    storm.messaging.transport: "backtype.storm.messaging.netty.Context"
    storm.messaging.netty.server_worker_threads: 1
    storm.messaging.netty.client_worker_threads: 1
    storm.messaging.netty.buffer_size: 5242880
    storm.messaging.netty.max_retries: 100
    storm.messaging.netty.max_wait_ms: 1000
    storm.messaging.netty.min_wait_ms: 100
    

    错误消息
    Pastebin for log error message。我在这里交叉发布相关位。

    java.lang.RuntimeException: java.io.EOFException
        at backtype.storm.utils.Utils.deserialize(Utils.java:86) ~[storm-core-0.9.1-incubating.jar:0.9.1-incubating]
        at backtype.storm.utils.LocalState.snapshot(LocalState.java:45) ~[storm-core-0.9.1-incubating.jar:0.9.1-incubating]
        at backtype.storm.utils.LocalState.get(LocalState.java:56) ~[storm-core-0.9.1-incubating.jar:0.9.1-incubating]
        at backtype.storm.daemon.supervisor$sync_processes.invoke(supervisor.clj:207) ~[storm-core-0.9.1-incubating.jar:0.9.1-incubating]
        at clojure.lang.AFn.applyToHelper(AFn.java:161) [clojure-1.4.0.jar:na]
        at clojure.lang.AFn.applyTo(AFn.java:151) [clojure-1.4.0.jar:na]
        at clojure.core$apply.invoke(core.clj:603) ~[clojure-1.4.0.jar:na]
        at clojure.core$partial$fn__4070.doInvoke(core.clj:2343) ~[clojure-1.4.0.jar:na]
        at clojure.lang.RestFn.invoke(RestFn.java:397) ~[clojure-1.4.0.jar:na]
        at backtype.storm.event$event_manager$fn__2593.invoke(event.clj:39) ~[na:na]
        at clojure.lang.AFn.run(AFn.java:24) [clojure-1.4.0.jar:na]
        at java.lang.Thread.run(Thread.java:679) [na:1.6.0_27]
    Caused by: java.io.EOFException: null
        at java.io.ObjectInputStream$PeekInputStream.readFully(ObjectInputStream.java:2322) ~[na:1.6.0_27]
        at java.io.ObjectInputStream$BlockDataInputStream.readShort(ObjectInputStream.java:2791) ~[na:1.6.0_27]
        at java.io.ObjectInputStream.readStreamHeader(ObjectInputStream.java:798) ~[na:1.6.0_27]
        at java.io.ObjectInputStream.<init>(ObjectInputStream.java:298) ~[na:1.6.0_27]
        at backtype.storm.utils.Utils.deserialize(Utils.java:81) ~[storm-core-0.9.1-incubating.jar:0.9.1-incubating]
        ... 11 common frames omitted
    2014-03-11 12:27:25 b.s.util [INFO] Halting process: ("Error when processing an event")
    

2 个答案:

答案 0 :(得分:5)

当我们在2台开发服务器上断电时,我们遇到了完全相同的问题(主管在启动时崩溃并出现相同的日志错误消息)。我想只是在没有事先停止主管的情况下停止服务器会产生同样的效果。

我们找到的唯一可行解决方案是删除“ storm-local / supervisor ”文件夹(我猜其中的内容已损坏)。

答案 1 :(得分:1)

我也遇到过类似的问题。我以前总是删除本地文件夹并重新启动拓扑。