Question

我有一个主节点和3个计算节点。主节点上的Julia位于/ apps和/ state / p1 / apps。

我没有julia作为slurm模块。

我应该如何设置Julia安装，以便我可以使用ClusterManager通过slurm调用Julia脚本？

目前我收到错误

srun: error: node-0-2: tasks 0-2: Exited with exit code 2

朱莉娅剧本：

using ClusterManagers

addprocs(SlurmManager(3), partition="slurm", t="00:5:00")

hosts = []
pids = []
for i in workers()
        host, pid = fetch(@spawnat i (gethostname(), getpid()))
        println(host)
        push!(hosts, host)
        push!(pids, pid)
end


# The Slurm resource allocation is released when all the workers have
# exited
for i in workers()
        rmprocs(i)
end

更新

我似乎有一个slurm问题。按照crstnbr的建议尝试按照@ user338207和SlurmManager（3）而不是SlurmManager（2）的建议更新ClusterManager。

srun -N 2 julia parallel2.jl
srun: error: node-0-2: task 2: Exited with exit code 1
srun: error: node-0-2: task 2: Exited with exit code 1
WARNING: dropping worker: file not created in 63 seconds
WARNING: dropping worker: file not created in 63 seconds
node-0-1 3 out of 3
node-0-1
WARNING: dropping worker: file not created in 63 seconds
ERROR: LoadError: connect: connection refused (ECONNREFUSED)
try_yieldto(::Base.##296#297{Task}, ::Task) at ./event.jl:189
wait() at ./event.jl:234
wait(::Condition) at ./event.jl:27
stream_wait(::TCPSocket, ::Condition, ::Vararg{Condition,N} where N) at ./stream.jl:42
wait_connected(::TCPSocket) at ./stream.jl:258

但是srun -N 2主机名工作正常

Answer 1

这是如何在linux集群上设置julia并通过slurm运行并行任务的。

从https://developers.google.com/save-to-android-pay/
将它们放在某处，例如放入~/bin/julia-v0.6（您必须创建此文件夹）。

在包含内容的同一文件夹中创建julia-environment文件

export PATH=$HOME/bin/julia-v0.6/bin:$PATH
export LD_LIBRARY_PATH=$HOME/bin/julia-v0.6/lib:$LD_LIBRARY_PATH
export CPATH=$HOME/bin/julia-v0.6/include:$CPATH

现在您可以使用sbatch myjobfile.sh提交

等作业文件

#!/bin/bash -l
#SBATCH --nodes=2
#SBATCH --ntasks=4
#SBATCH --ntasks-per-node=2
#SBATCH --time=00:10:00
#SBATCH --output=myoutput.log
#SBATCH --job-name=my-julia-job

source $HOME/julia-v0.6/julia-environment
cd working/folder/of/your/choice
julia my_clustermanager_script.jl

（请注意，也可以在julia命令前添加srun --ntasks=1，请参阅此julialang.org。）

当然，您也可以通过salloc分配资源来启动交互式作业。

<强>更新

使用sbatch myjobfile.sh运行上面的作业脚本（通过my_clustermanager_script.jl）（注意SlurmManager(4)而不是SlurmManager(3)）

using ClusterManagers

addprocs(SlurmManager(4), t="00:5:00")

hosts = []
pids = []
for i in workers()
        host, pid = fetch(@spawnat i (gethostname(), getpid()))
        println(host)
        push!(hosts, host)
        push!(pids, pid)
end


# The Slurm resource allocation is released when all the workers have
# exited
for i in workers()
        rmprocs(i)
end

我得到以下输出文件：

myoutput.log：

connecting to worker 1 out of 4
connecting to worker 2 out of 4
connecting to worker 3 out of 4
connecting to worker 4 out of 4
cheops30410
cheops30410
cheops30414
cheops30414

job0000.out：julia_worker:9009#173.12.2.191

job0001.out：julia_worker:9010#173.12.2.191

job0002.out：julia_worker:9010#173.12.2.192

job0003.out：julia_worker:9009#173.12.2.192

Answer 2

我使用与crstnbr类似的脚本，事实上，我也遇到了问题srun: unrecognized option '--enable-threaded-blas=false'。我必须更改src/slurm.jl已在此处描述：

https://github.com/JuliaParallel/ClusterManagers.jl/issues/75#issuecomment-319919108

此更改已在ClusterManagers.jl的0.2.0版中实现，也许您仍在使用版本0.1.2。如果是这种情况，那么升级可能会解决问题。

Julia不允许您使用本地修改升级包。此类软件包将在版本号后面加上+号。

如果您不想保留本地修改，则以下是升级脏包的步骤（特别是如果新版本已包含您对本地副本所做的更改）：

cd ~/.julia/v0.6/ClusterManagers/
git diff # show your modification
cp -R ~/.julia/v0.6/ClusterManagers/ ~/ClusterManagers.bak # backup copy
git checkout . # discard your modification
julia --eval 'Pkg.update("ClusterManagers")' # upgrade the package

Julia和Slurm设置

2 个答案: