为什么主机指定的内存与Slurm脚本中请求的内存不对应?

时间:2019-05-08 01:39:14

标签: multithreading apache-spark mpi slurm

我正在使用以下脚本脚本运行spark 2.3.0。

#!/bin/bash
#SBATCH --account=def-hmcheick
#SBATCH --nodes=2
#SBATCH --time=00:10:00
#SBATCH --mem=100G
#SBATCH --cpus-per-task=5
#SBATCH --ntasks-per-node=6
#SBATCH --output=/project/6008168/moudi/job/spark-job/sparkjob-%j.out
#SBATCH --mail-type=ALL
#SBATCH --error=/project/6008168/moudi/job/spark-job/error6_hours.out



## --------------------------------------
## 0. Preparation
## --------------------------------------

# load the Spark module
module load spark/2.3.0
module load python/3.7.0
source "/home/moudi/ENV3.7.0/bin/activate"

set -x
# identify the Spark cluster with the Slurm jobid
export SPARK_IDENT_STRING=$SLURM_JOBID

# prepare directories
export SPARK_WORKER_DIR=$HOME/.spark/2.3.0/$SPARK_IDENT_STRING/worker
export SPARK_LOG_DIR=$HOME/.spark/2.3.0/$SPARK_IDENT_STRING/logs
export SPARK_LOCAL_DIRS=$HOME/.spark/2.3.0/$SPARK_IDENT_STRING/tmp/spark
mkdir -p $SPARK_LOG_DIR $SPARK_WORKER_DIR $SPARK_LOCAL_DIRS

# These are the defaults anyways, but configurable 
export SPARK_MASTER_PORT=7077
export SPARK_MASTER_WEBUI_PORT=8080

export JOB_HOME="$HOME/.spark/2.3.0/$SPARK_IDENT_STRING"
echo "line 39----JOB_HOME=$JOB_HOME"
echo "line 40----SPARK_HOME=$SPARK_HOME"
mkdir -p $JOB_HOME

# Try to load stuff that the spark scripts will load
source "$SPARK_HOME/sbin/spark-config.sh"
source "$SPARK_HOME/bin/load-spark-env.sh"

## --------------------------------------
## 1. Start the Spark cluster master
## --------------------------------------

$SPARK_HOME/sbin/start-master.sh
sleep 5
MASTER_URL=$(grep -Po '(?=spark://).*' $SPARK_LOG_DIR/spark-${SPARK_IDENT_STRING}-org.apache.spark.deploy.*master*.out)
echo "line 54----MASTER_URL = ${MASTER_URL}"


## --------------------------------------
## 2. Start the Spark cluster workers
## --------------------------------------

# get the resource details from the Slurm job
export SPARK_WORKER_CORES=${SLURM_CPUS_PER_TASK:-1}
export SPARK_MEM=$(( ${SLURM_MEM_PER_CPU:-3072} * ${SLURM_CPUS_PER_TASK:-1} ))
#export SLURM_SPARK_MEM=$(printf "%.0f" $((${SLURM_MEM_PER_NODE} *93/100)))
export SPARK_DAEMON_MEMORY=${SPARK_MEM}m
export SPARK_WORKER_MEMORY=${SPARK_MEM}
NWORKERS=${SLURM_NTASKS:-1} #just for testing you should delete this line
NEXECUTORS=$((SLURM_NTASKS - 1))

# start the workers on each node allocated to the job
export SPARK_NO_DAEMONIZE=1

srun -n ${NWORKERS} -N $SLURM_JOB_NUM_NODES --label --output=$SPARK_LOG_DIR/spark-%j-workers.out start-slave.sh -m ${SPARK_MEM}M -c ${SLURM_CPUS_PER_TASK} ${MASTER_URL}  &

## --------------------------------------
## 3. Submit a task to the Spark cluster
## --------------------------------------
spark-submit --master ${MASTER_URL} --total-executor-cores $((SLURM_NTASKS * SLURM_CPUS_PER_TASK)) --executor-memory ${SPARK_WORKER_MEMORY}m  --num-executors $((SLURM_NTASKS - 1)) --driver-memory ${SPARK_WORKER_MEMORY}m /project/6008168/moudi/mainold.py

flag_path=$JOB_HOME/master_host
export SPARK_MASTER_IP=$( hostname )
echo "line 81----SPARK_MASTER_IP=$SPARK_MASTER_IP"
MASTER_NODE=$( scontrol show hostname $SLURM_NODELIST | head -n 1 )
MASTER_NODE=$MASTER_NODE.int.cedar.computecanada.ca
MASTER_URL="spark://$MASTER_NODE:$SPARK_MASTER_PORT"

## --------------------------------------
## 4. Clean up
## --------------------------------------


# stop the workers
scancel ${SLURM_JOBID}.0

# stop the master
$SPARK_HOME/sbin/stop-master.sh

该脚本无法正常运行。我遇到以下问题: VM初始化期间发生错误 初始堆太小

实际上,在主文件的输出文件中,我得到了开头:

Spark Command: /cvmfs/soft.computecanada.ca/easybuild/software/2017/Core/java/1.8.0_121/bin/java -cp /home/moudi/.spark/2.3.0/conf/:/cvmfs/soft.computecanada.ca/easybuild/software/2017/Core/spark/2.3.0/jars/* -Xmx1g org.apache.spark.deploy.master.Master --host cdr1272.int.cedar.computecanada.ca --port 7077 --webui-port 8080

由于-Xmx1g火花不起作用。您能通过诊断帮助我吗,为什么是1克?我已经为主机指定了15g的内存。

在同一文件(主输出)中,我可以看到12个具有5个内核且每个内核15g的工作程序:

19/05/07 15:11:25 INFO Master: Registering worker 172.16.140.247:46822 with 5 cores, 15.0 GB RAM
19/05/07 15:11:25 INFO Master: Registering worker 172.16.141.2:41554 with 5 cores, 15.0 GB RAM
19/05/07 15:11:25 INFO Master: Registering worker 172.16.141.2:38652 with 5 cores, 15.0 GB RAM
19/05/07 15:11:25 INFO Master: Registering worker 172.16.140.247:35553 with 5 cores, 15.0 GB RAM
19/05/07 15:11:25 INFO Master: Registering worker 172.16.140.247:43477 with 5 cores, 15.0 GB RAM
19/05/07 15:11:25 INFO Master: Registering worker 172.16.140.247:36128 with 5 cores, 15.0 GB RAM
19/05/07 15:11:25 INFO Master: Registering worker 172.16.140.247:35494 with 5 cores, 15.0 GB RAM
19/05/07 15:11:27 INFO Master: Registering worker 172.16.141.2:34899 with 5 cores, 15.0 GB RAM
19/05/07 15:11:27 INFO Master: Registering worker 172.16.140.247:40010 with 5 cores, 15.0 GB RAM
19/05/07 15:11:29 INFO Master: Registering worker 172.16.141.2:37054 with 5 cores, 15.0 GB RAM
19/05/07 15:11:31 INFO Master: Registering worker 172.16.141.2:37322 with 5 cores, 15.0 GB RAM
19/05/07 15:11:33 INFO Master: Registering worker 172.16.141.2:36519 with 5 cores, 15.0 GB RAM

...

19/05/07 15:11:58 INFO Master: Launching executor app-20190507151158-0000/0 on worker worker-20190507151124-172.16.140.247-36128
19/05/07 15:11:58 INFO Master: Launching executor app-20190507151158-0000/1 on worker worker-20190507151124-172.16.141.2-38652
19/05/07 15:11:58 INFO Master: Launching executor app-20190507151158-0000/2 on worker worker-20190507151124-172.16.141.2-41554
19/05/07 15:11:58 INFO Master: Launching executor app-20190507151158-0000/3 on worker worker-20190507151124-172.16.140.247-43477
19/05/07 15:11:58 INFO Master: Launching executor app-20190507151158-0000/4 on worker worker-20190507151126-172.16.141.2-34899
19/05/07 15:11:58 INFO Master: Launching executor app-20190507151158-0000/5 on worker worker-20190507151128-172.16.141.2-37054
19/05/07 15:11:58 INFO Master: Launching executor app-20190507151158-0000/6 on worker worker-20190507151124-172.16.140.247-35553
19/05/07 15:11:58 INFO Master: Launching executor app-20190507151158-0000/7 on worker worker-20190507151130-172.16.141.2-37322
19/05/07 15:11:58 INFO Master: Launching executor app-20190507151158-0000/8 on worker worker-20190507151124-172.16.140.247-46822
19/05/07 15:11:58 INFO Master: Launching executor app-20190507151158-0000/9 on worker worker-20190507151124-172.16.140.247-35494
19/05/07 15:11:58 INFO Master: Launching executor app-20190507151158-0000/10 on worker worker-20190507151132-172.16.141.2-36519
19/05/07 15:11:58 INFO Master: Launching executor app-20190507151158-0000/11 on worker worker-20190507151127-172.16.140.247-40010

此外,在工作目录中,我看到一个名为app-20190507151158-0000的子文件夹。后者还具有11个子文件夹0..11。每个都有stderr文件的文件看起来就像一个日志文件。我也注意到每个文件 ...

19/05/05 23:49:38 INFO MemoryStore: MemoryStore started with capacity 7.8 GB

我不知道这到底是什么意思?执行器是否只有7.8 GB或15Y TH?

0 个答案:

没有答案