我提交了一些群集模式的火花作业,这些作业在我使用以下摘要规格一个接一个地运行时运行得很好。
#!/bin/bash -l
#SBATCH -J Spark
#SBATCH --time=0-05:00:00 # 5 hour
#SBATCH --partition=batch
#SBATCH --qos qos-batch
###SBATCH -N $NODES
###SBATCH --ntasks-per-node=$NTASKS
### -c, --cpus-per-task=<ncpus>
### (multithreading) Request that ncpus be allocated per process
#SBATCH -c 7
#SBATCH --exclusive
#SBATCH --mem=0
#SBATCH --dependency=singleton
如果我使用启动器提交具有不同节点和任务编号的同一作业,则系统会感到困惑,并尝试根据给出16的$ SLURM_NTASK进行分配。但是,例如,我只要求1个节点,3个任务。
#!/bin/bash -l
for n in {1..4}
do
for t in {3..4}
do
echo "Running benchmark with ${n} nodes and ${t} tasks per node"
sbatch -N ${n} --ntasks-per-node=${t} spark-teragen.sh
sleep 5
sbatch -N ${n} --ntasks-per-node=${t} spark-terasort.sh
sleep 5
sbatch -N ${n} --ntasks-per-node=${t} spark-teravalidate.sh
sleep 5
done
done
如何通过防止针对每个节点的错误分配奇怪的任务数超过限制来解决以下错误。
错误:
srun: Warning: can't honor --ntasks-per-node set to 3 which doesn't match the
requested tasks 16 with the number of requested nodes 1. Ignoring --ntasks-per-node.
srun: error: Unable to create step for job 233838: More processors requested than
permitted