扭矩BLCR检查点与静态链接可执行文件

时间:2013-10-02 15:48:28

标签: torque blcr

我正在尝试使用Berkeley Lab检查点(BLCR)方案检查扭矩作业调度程序正在处理的作业,并且在尝试cr_run'my_exec'时出现错误,因为我认为可执行文件在编译时是静态链接的。提交脚本看起来像(简化版,伪版):

#!/bin/bash
#PBS -q workq
#PBS -l nodes=1:ppn=4
#PBS -l pmem=1gb,pvmem=2gb
#PBS -l walltime=30:00:00
#PBS -o out.log
#PBS -N jobname
#PBS -j oe

cd $PBS_O_WORKDIR

NNODES=$(uniq $PBS_NODEFILE | wc -l)
NP=$(wc -l $PBS_NODEFILE | awk '{print $1}')
echo PBS_NODEFILE is $PBS_NODEFILE
echo NNODES is $NNODES
cat $PBS_NODEFILE

cr_run 'executable' infile.inp > outfile.out &

## store process ID as variable and sleep 29 hours, then checkpoint
BGPID=$!
sleep 104400

cr_checkpoint -p $BGPID -f checkFile.checkpoint --term

我使用动态链接的二进制文件(主要是我自己编写的代码构建的可执行文件)成功检查了作业,因此我已经知道如何执行此操作。问题是我尝试运行的可执行文件是预编译的 我没有源代码,或者这不是问题。

我发现文档here(见4.2)似乎提供了一些建议,但在尝试破译和测试这些建议之前,我认为有必要看看是否有人有检查点工作的经验来自在编译时未动态链接的可执行文件。

作为旁注,代码没有内部检查点。此外,我们使用一种更有礼貌的检查点方式,而不是睡了29个小时,我只是将其包含在内,不会使脚本混乱并使其更具可读性。

1 个答案:

答案 0 :(得分:1)

答案在BLCR常见问题解答中提到:https://upc-bugs.lbl.gov/blcr/doc/html/FAQ.html#staticlink

If you can checkpoint and restart a dynamically linked application correctly, but 
cannot do so with the same application linked statically, this FAQ entry is for you.
There are multiple reasons why BLCR may have problems with statically executables.

The cr_run utility only supports dynamic executables
If you wish to checkpoint an unmodified executable, the typical recipe is

$ cr_run my_app my_args

However, the cr_run utility does its work using the "LD_PRELOAD" environment variable 
to force loading of BLCR's support code into the address space the applications. That 
mechanism is only functional for dynamically linked executables. There is no magic we 
can perform today that will resolve this (though in the future we'd like to replace 
our use of LD_PRELOAD with a kernel-side mechanism). So, you'll need to relink any 
statically linked executables to include BLCR support.

** Linking BLCR's libraries statically takes special care **
OK, we've told you why cr_run doesn't work and told you to relink. You tried linking 
with -lcr_run and/or -lcr and still can't get a checkpoint to work. What went wrong?
You need a -u option in addition the the -l or the static linking will simply ignore 
BLCR's library.

** BLCR doesn't support LinuxThreads **
Ok, what else could go wrong? You've followed the guidance given in the "Cautionary
linker notes" section of the BLCR Users Guide when you linked your application. You 
even ran

$ nm my_app | grep link_me

to be sure the symbol you specified with -u is linked in. However, you are seeing 
weird crashes of your application when you try to checkpoint.

The culprit might be LinuxThreads. Why? Because at the time this FAQ entry is being 
written, there are many Linux distributions that install the static libs for 
LinuxThreads in the default library search path, and with the NPTL static libs 
elsewhere. The resolution could be as simple as linking your application with -L/usr
/lib/nptl or -L/usr/lib64/nptl, perhaps by setting an "LDFLAGS" variable (though it is 
possible that your distribution has picked some other location).

While it is not strictly required due to binary compatibility between LinuxThreads and 
NPTL, we'd recommend that you at least consider a recompile with -I/usr/include/nptl 
in CFLAGS.

Note, of course, that if BLCR's utilities are statically linked to LinuxThreads, then 
they need to be rebuilt too. See the BLCR Admin Guide for more information on that.