Julia和Slurm设置

时间:2018-02-05 21:03:01

标签: parallel-processing julia slurm

我有一个主节点和3个计算节点。 主节点上的Julia位于/ apps和/ state / p1 / apps。

我没有julia作为slurm模块。

我应该如何设置Julia安装,以便我可以使用ClusterManager通过slurm调用Julia脚本?

目前我收到错误

srun: error: node-0-2: tasks 0-2: Exited with exit code 2

朱莉娅剧本:

using ClusterManagers

addprocs(SlurmManager(3), partition="slurm", t="00:5:00")

hosts = []
pids = []
for i in workers()
        host, pid = fetch(@spawnat i (gethostname(), getpid()))
        println(host)
        push!(hosts, host)
        push!(pids, pid)
end


# The Slurm resource allocation is released when all the workers have
# exited
for i in workers()
        rmprocs(i)
end

更新

我似乎有一个slurm问题。按照crstnbr的建议尝试按照@ user338207和SlurmManager(3)而不是SlurmManager(2)的建议更新ClusterManager。

srun -N 2 julia parallel2.jl
srun: error: node-0-2: task 2: Exited with exit code 1
srun: error: node-0-2: task 2: Exited with exit code 1
WARNING: dropping worker: file not created in 63 seconds
WARNING: dropping worker: file not created in 63 seconds
node-0-1 3 out of 3
node-0-1
WARNING: dropping worker: file not created in 63 seconds
ERROR: LoadError: connect: connection refused (ECONNREFUSED)
try_yieldto(::Base.##296#297{Task}, ::Task) at ./event.jl:189
wait() at ./event.jl:234
wait(::Condition) at ./event.jl:27
stream_wait(::TCPSocket, ::Condition, ::Vararg{Condition,N} where N) at ./stream.jl:42
wait_connected(::TCPSocket) at ./stream.jl:258

但是srun -N 2主机名工作正常

2 个答案:

答案 0 :(得分:0)

这是如何在linux集群上设置julia并通过slurm运行并行任务的。

  1. https://developers.google.com/save-to-android-pay/
  2. 下载通用linux二进制文件
  3. 将它们放在某处,例如放入~/bin/julia-v0.6(您必须创建此文件夹)。
  4. 在包含内容的同一文件夹中创建julia-environment文件

    export PATH=$HOME/bin/julia-v0.6/bin:$PATH
    export LD_LIBRARY_PATH=$HOME/bin/julia-v0.6/lib:$LD_LIBRARY_PATH
    export CPATH=$HOME/bin/julia-v0.6/include:$CPATH
    
  5. 现在您可以使用sbatch myjobfile.sh提交

    等作业文件
    #!/bin/bash -l
    #SBATCH --nodes=2
    #SBATCH --ntasks=4
    #SBATCH --ntasks-per-node=2
    #SBATCH --time=00:10:00
    #SBATCH --output=myoutput.log
    #SBATCH --job-name=my-julia-job
    
    source $HOME/julia-v0.6/julia-environment
    cd working/folder/of/your/choice
    julia my_clustermanager_script.jl
    
  6. (请注意,也可以在julia命令前添加srun --ntasks=1,请参阅此julialang.org。)

    当然,您也可以通过salloc分配资源来启动交互式作业。

    <强>更新

    使用sbatch myjobfile.sh运行上面的作业脚本(通过my_clustermanager_script.jl)(注意SlurmManager(4)而不是SlurmManager(3)

    using ClusterManagers
    
    addprocs(SlurmManager(4), t="00:5:00")
    
    hosts = []
    pids = []
    for i in workers()
            host, pid = fetch(@spawnat i (gethostname(), getpid()))
            println(host)
            push!(hosts, host)
            push!(pids, pid)
    end
    
    
    # The Slurm resource allocation is released when all the workers have
    # exited
    for i in workers()
            rmprocs(i)
    end
    

    我得到以下输出文件:

    myoutput.log

    connecting to worker 1 out of 4
    connecting to worker 2 out of 4
    connecting to worker 3 out of 4
    connecting to worker 4 out of 4
    cheops30410
    cheops30410
    cheops30414
    cheops30414
    

    job0000.outjulia_worker:9009#173.12.2.191

    job0001.outjulia_worker:9010#173.12.2.191

    job0002.outjulia_worker:9010#173.12.2.192

    job0003.outjulia_worker:9009#173.12.2.192

答案 1 :(得分:0)

我使用与crstnbr类似的脚本,事实上,我也遇到了问题srun: unrecognized option '--enable-threaded-blas=false'。我必须更改src/slurm.jl已在此处描述:

https://github.com/JuliaParallel/ClusterManagers.jl/issues/75#issuecomment-319919108

此更改已在ClusterManagers.jl的0.2.0版中实现,也许您仍在使用版本0.1.2。如果是这种情况,那么升级可能会解决问题。

Julia不允许您使用本地修改升级包。此类软件包将在版本号后面加上+号。

如果您不想保留本地修改,则以下是升级脏包的步骤(特别是如果新版本已包含您对本地副本所做的更改):

cd ~/.julia/v0.6/ClusterManagers/
git diff # show your modification
cp -R ~/.julia/v0.6/ClusterManagers/ ~/ClusterManagers.bak # backup copy
git checkout . # discard your modification
julia --eval 'Pkg.update("ClusterManagers")' # upgrade the package