Shell脚本挂起,但只有在调用变量或strace时才会挂起

时间:2012-12-08 21:19:10

标签: linux bash shell

一般问题:什么可能导致脚本本身工作正常,如果调用它的脚本或shell(bash)命令将其调用为变量,则挂起?

换句话说,怎么会有一个像这样被调用的脚本... /path/to/script arg arg ...失败并在这样调用时挂起...... VAR=$(/path/to/script arg arg);


(在注意到软件故障导致大量初始测试以提供错误结果后进行主要编辑)


我的具体案例:我的脚本运行正常(启动,停止或重启Java应用程序Apache Solr,adapted from here)。代码如下,其命令为sbin/service solr [action],例如sbin/service solr start

从脚本调用或直接从控制台调用(在我的情况下为bash),如sbin/service solr start,它可以正常工作并快速完成。但是,如果它被调用为变量,如VAR=$(sbin/service solr start);,它可以工作,但会挂起futext / clock_gettime循环(下面的跟踪)。如果它不是变成变量,而是变成strace

,它也会挂起

奇怪的是,其他脚本以相同的方式使用相同的语法调用,例如sbin/service httpd start,在调用变量时工作得很好。因此,当输出存储为变量时,显然有可能存在一些脚本使其挂起,但在不是这种情况时运行得非常好。


以下是测试哪些调用挂起而哪些调用挂起的结果:

挂勾 --------------------------------------- ---------

  • VAR=$(/sbin/service solr start);
  • VAR=$(source /sbin/service solr start);
  • VAR=$(nohup /sbin/service solr start &);

(因此调用它的过程并不重要)此外,编辑脚本文件以使用source启动服务会导致服务无效。

不要挂 ------------------------------------ -

  • VAR=$(/sbin/service solr start >> /dev/null);

输出到/dev/null允许我们在不导致输出挂起的情况下请求输出。但是,它并没有多大用处,因为没有收到实际的输出。

  • /sbin/service solr start

与我最初的想法相反。这会输出一条简单的更新消息,理想情况下,我们会在变量和日志中捕获它 - 但尝试这样做会导致它挂起。

  • VAR=$(/sbin/service httpd restart);

挂起的语法在其他service脚本上运行正常,并且脚本的输出会毫无问题地传递给变量。


以下是该脚本的完整代码:(注释已删除,自然$ SOLR_DIR路径是真实脚本中的真实路径)

SOLR_DIR="[path/to/application]"
JAVA_OPTIONS="-Xms64m -Xmx64m -DSTOP.PORT=8079 -DSTOP.KEY=mustard -jar start.jar"
LOG_FILE="/var/log/solr.log"
JAVA="/usr/bin/java"

case $1 in
    start)
        echo "Starting Solr"
        cd $SOLR_DIR
        $JAVA $JAVA_OPTIONS 2> $LOG_FILE &
        ;;
    stop)
        echo "Stopping Solr"
        cd $SOLR_DIR
        $JAVA $JAVA_OPTIONS --stop
        ;;
    restart)
        $0 stop
        sleep 1
        $0 start
        ;;
    *)
        echo "Usage: $0 {start|stop|restart}" >&2
        exit 1
        ;;
esac

var/log/solr.log(脚本中指定的日志文件)中没有错误或任何异常。 Centos Linux服务器,如果这是相关的。


在回答问题的早期版本时,@ cdarke建议我在调用此脚本的脚本上运行strace -f -o strace.out /path/to/script,并查看(海量!)输出文件strace.out。这是近3mbs,这里有一些观察:

  1. 从许多活动开始,看起来脚本按预期运行。

  2. 然后,日志文件的最后15%左右是这个,用不同的整数重复,但看似相同的十六进制代码:

  3. ...

    25687 futex(0x688d454, FUTEX_WAIT_PRIVATE, 1, {0, 49980000}) = -1 ETIMEDOUT (Connection timed out)
    25687 futex(0x688d428, FUTEX_WAKE_PRIVATE, 1) = 0
    25687 clock_gettime(CLOCK_MONOTONIC, {39074112, 932735888}) = 0
    25687 clock_gettime(CLOCK_REALTIME, {1355007234, 333458000}) = 0
    

    这些PID在通过ps -p 时没有任何内容,即使我在脚本仍在运行时这样做,而输出文件仍在变大,而这些代码行仍在写入。我不太确定这是怎么可能的。

    这是输出之前的最后一位它进入永无止境的futex / clock_gettime循环,之后的最后一部分显然是脚本正确执行({ {1}}是一个Solr配置文件,需要读取它才能启动Solr进程):

    solr/solr.xml

    所以死亡螺旋之前的最后一行是通道12上的25874 stat("solr/solr.xml", {st_mode=S_IFREG|0777, st_size=1320, ...}) = 0 25874 write(2, "Dec 8, 2012 5:12:05 PM org.apach"..., 106) = 106 25874 socket(PF_INET, SOCK_STREAM, IPPROTO_IP) = 89 25874 fcntl(89, F_GETFL) = 0x2 (flags O_RDWR) 25874 fcntl(89, F_SETFL, O_RDWR|O_NONBLOCK) = 0 25874 setsockopt(89, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0 25874 bind(89, {sa_family=AF_INET, sin_port=htons(8983), sin_addr=inet_addr("0.0.0.0")}, 16) = 0 25874 listen(89, 50) = 0 25874 setsockopt(89, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0 25874 lseek(12, 57747, SEEK_SET) = 57747 25874 read(12, "PK\3\4\n\0\0\0\10\0\221Vi>F\347\254\364\325\4\0\0002\t\0\0002\0\0\0", 30) = 30 25874 lseek(12, 57827, SEEK_SET) = 57827 25874 read(12, "\225V\377oSU\24\377\334\273\256\257_\36l\216m\254\262\351\224\241]\273\255\200\314/\5\246c\200"..., 1237) = 1237 25874 futex(0x2aaab0173054, FUTEX_WAKE_OP_PRIVATE, 1, 1, 0x2aaab0173050, {FUTEX_OP_SET, 0, FUTEX_OP_CMP_GT, 1} <unfinished ...> 25894 <... futex resumed> ) = 0 25894 futex(0x2aaab0173028, FUTEX_WAIT_PRIVATE, 2, NULL <unfinished ...> 25874 <... futex resumed> ) = 1 25874 futex(0x2aaab0173028, FUTEX_WAKE_PRIVATE, 1 <unfinished ...> 25894 <... futex resumed> ) = 0 25894 futex(0x2aaab0173028, FUTEX_WAKE_PRIVATE, 1) = 0 25894 clock_gettime(CLOCK_REALTIME, {1355008325, 376033000}) = 0 25894 futex(0x2aaab0173054, FUTEX_WAIT_PRIVATE, 3, {0, 983000} <unfinished ...> 25874 <... futex resumed> ) = 1 25874 futex(0x2aaab0173054, FUTEX_WAKE_OP_PRIVATE, 1, 1, 0x2aaab0173050, {FUTEX_OP_SET, 0, FUTEX_OP_CMP_GT, 1} <unfinished ...> 25894 <... futex resumed> ) = 0 25894 futex(0x2aaab0173028, FUTEX_WAIT_PRIVATE, 2, NULL <unfinished ...> 25874 <... futex resumed> ) = 1 25874 futex(0x2aaab0173028, FUTEX_WAKE_PRIVATE, 1 <unfinished ...> 25894 <... futex resumed> ) = 0 25894 futex(0x2aaab0173028, FUTEX_WAKE_PRIVATE, 1) = 0 25894 poll([{fd=89, events=POLLIN|POLLERR}], 1, -1 <unfinished ...> 25874 <... futex resumed> ) = 1 25874 write(2, "2012-12-08 17:12:05.376:INFO::St"..., 66) = 66 25874 write(2, "\n", 1) = 1 25874 mmap(0x41348000, 12288, PROT_NONE, MAP_PRIVATE|MAP_FIXED|MAP_ANONYMOUS|MAP_NORESERVE, -1, 0) = 0x41348000 25874 rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0 25874 sched_getaffinity(25874, 32, { ffff, 0, 0, 0 }) = 32 25874 sched_getaffinity(25874, 32, { ffff, 0, 0, 0 }) = 32 25874 gettid() = 25874 25874 rt_sigprocmask(SIG_BLOCK, NULL, [], 8) = 0 25874 rt_sigprocmask(SIG_UNBLOCK, [HUP ILL BUS FPE SEGV USR2 TERM], NULL, 8) = 0 25874 rt_sigprocmask(SIG_BLOCK, [QUIT], NULL, 8) = 0 25874 mmap(0x41348000, 12288, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_ANONYMOUS, -1, 0) = 0x41348000 25874 mprotect(0x41348000, 12288, PROT_NONE) = 0 25874 futex(0x10632d54, FUTEX_WAIT_PRIVATE, 1, NULL <unfinished ...> 25882 <... futex resumed> ) = -1 ETIMEDOUT (Connection timed out) 25882 futex(0x106cc428, FUTEX_WAKE_PRIVATE, 1) = 0 25882 clock_gettime(CLOCK_MONOTONIC, {39075204, 21489888}) = 0 25882 clock_gettime(CLOCK_REALTIME, {1355008325, 422198000}) = 0 25882 futex(0x106cc454, FUTEX_WAIT_PRIVATE, 1, {0, 49984000}) = -1 ETIMEDOUT (Connection timed out) 25882 futex(0x106cc428, FUTEX_WAKE_PRIVATE, 1) = 0 25882 clock_gettime(CLOCK_MONOTONIC, {39075204, 72479888}) = 0 25882 clock_gettime(CLOCK_REALTIME, {1355008325, 473185000}) = 0 25882 futex(0x106cc454, FUTEX_WAIT_PRIVATE, 1, {0, 49987000}) = -1 ETIMEDOUT (Connection timed out) 25882 futex(0x106cc428, FUTEX_WAKE_PRIVATE, 1) = 0 。然后它只是循环futex和clock_gettime直到它被手动杀死。


    最后一点可能无关紧要,但如果,similar to in this question,我运行使用read()调用此脚本的脚本并将输出转移到nohup,我会在开始时得到以下内容(大约100kb到输出文件中):大量的这些:

    /dev/null

    他们从67岁开始,每次增加+1,到:

    25664 close(67) = -1 EBADF (Bad file descriptor)

    然后他们跟着

    25664 close(1023) = -1 EBADF (Bad file descriptor)

    同样,就我所见,PID是空的。不确定这是否相关 - 我想这开启了将nohup与输出一起使用到/ dev / null的可能性确实是对这类问题的一般修复,但我在某种程度上做错了,导致这些错误。

1 个答案:

答案 0 :(得分:2)

我很确定问题是shell正在从/ sbin / service脚本捕获输出并它启动的solr服务,因此会等待服务退出(或者在继续之前最少关闭它的stdout。这是一个简单的演示:

$ bg_service() { while true; do sleep 10; done; }
$ start_bg_service() { echo "starting"; bg_service& echo "running"; }
$ start_bg_service 
starting
[1] 8656
running
$ var=$(start_bg_service)
[It hangs at this point... until I open another shell and kill the background process]