我正在使用以下脚本脚本运行spark 2.3.0。
#!/bin/bash
#SBATCH --account=def-hmcheick
#SBATCH --nodes=2
#SBATCH --time=00:10:00
#SBATCH --mem=100G
#SBATCH --cpus-per-task=5
#SBATCH --ntasks-per-node=6
#SBATCH --output=/project/6008168/moudi/job/spark-job/sparkjob-%j.out
#SBATCH --mail-type=ALL
#SBATCH --error=/project/6008168/moudi/job/spark-job/error6_hours.out
## --------------------------------------
## 0. Preparation
## --------------------------------------
# load the Spark module
module load spark/2.3.0
module load python/3.7.0
source "/home/moudi/ENV3.7.0/bin/activate"
set -x
# identify the Spark cluster with the Slurm jobid
export SPARK_IDENT_STRING=$SLURM_JOBID
# prepare directories
export SPARK_WORKER_DIR=$HOME/.spark/2.3.0/$SPARK_IDENT_STRING/worker
export SPARK_LOG_DIR=$HOME/.spark/2.3.0/$SPARK_IDENT_STRING/logs
export SPARK_LOCAL_DIRS=$HOME/.spark/2.3.0/$SPARK_IDENT_STRING/tmp/spark
mkdir -p $SPARK_LOG_DIR $SPARK_WORKER_DIR $SPARK_LOCAL_DIRS
# These are the defaults anyways, but configurable
export SPARK_MASTER_PORT=7077
export SPARK_MASTER_WEBUI_PORT=8080
export JOB_HOME="$HOME/.spark/2.3.0/$SPARK_IDENT_STRING"
echo "line 39----JOB_HOME=$JOB_HOME"
echo "line 40----SPARK_HOME=$SPARK_HOME"
mkdir -p $JOB_HOME
# Try to load stuff that the spark scripts will load
source "$SPARK_HOME/sbin/spark-config.sh"
source "$SPARK_HOME/bin/load-spark-env.sh"
## --------------------------------------
## 1. Start the Spark cluster master
## --------------------------------------
$SPARK_HOME/sbin/start-master.sh
sleep 5
MASTER_URL=$(grep -Po '(?=spark://).*' $SPARK_LOG_DIR/spark-${SPARK_IDENT_STRING}-org.apache.spark.deploy.*master*.out)
echo "line 54----MASTER_URL = ${MASTER_URL}"
## --------------------------------------
## 2. Start the Spark cluster workers
## --------------------------------------
# get the resource details from the Slurm job
export SPARK_WORKER_CORES=${SLURM_CPUS_PER_TASK:-1}
export SPARK_MEM=$(( ${SLURM_MEM_PER_CPU:-3072} * ${SLURM_CPUS_PER_TASK:-1} ))
#export SLURM_SPARK_MEM=$(printf "%.0f" $((${SLURM_MEM_PER_NODE} *93/100)))
export SPARK_DAEMON_MEMORY=${SPARK_MEM}m
export SPARK_WORKER_MEMORY=${SPARK_MEM}
NWORKERS=${SLURM_NTASKS:-1} #just for testing you should delete this line
NEXECUTORS=$((SLURM_NTASKS - 1))
# start the workers on each node allocated to the job
export SPARK_NO_DAEMONIZE=1
srun -n ${NWORKERS} -N $SLURM_JOB_NUM_NODES --label --output=$SPARK_LOG_DIR/spark-%j-workers.out start-slave.sh -m ${SPARK_MEM}M -c ${SLURM_CPUS_PER_TASK} ${MASTER_URL} &
## --------------------------------------
## 3. Submit a task to the Spark cluster
## --------------------------------------
spark-submit --master ${MASTER_URL} --total-executor-cores $((SLURM_NTASKS * SLURM_CPUS_PER_TASK)) --executor-memory ${SPARK_WORKER_MEMORY}m --num-executors $((SLURM_NTASKS - 1)) --driver-memory ${SPARK_WORKER_MEMORY}m /project/6008168/moudi/mainold.py
flag_path=$JOB_HOME/master_host
export SPARK_MASTER_IP=$( hostname )
echo "line 81----SPARK_MASTER_IP=$SPARK_MASTER_IP"
MASTER_NODE=$( scontrol show hostname $SLURM_NODELIST | head -n 1 )
MASTER_NODE=$MASTER_NODE.int.cedar.computecanada.ca
MASTER_URL="spark://$MASTER_NODE:$SPARK_MASTER_PORT"
## --------------------------------------
## 4. Clean up
## --------------------------------------
# stop the workers
scancel ${SLURM_JOBID}.0
# stop the master
$SPARK_HOME/sbin/stop-master.sh
该脚本无法正常运行。我遇到以下问题: VM初始化期间发生错误 初始堆太小
实际上,在主文件的输出文件中,我得到了开头:
Spark Command: /cvmfs/soft.computecanada.ca/easybuild/software/2017/Core/java/1.8.0_121/bin/java -cp /home/moudi/.spark/2.3.0/conf/:/cvmfs/soft.computecanada.ca/easybuild/software/2017/Core/spark/2.3.0/jars/* -Xmx1g org.apache.spark.deploy.master.Master --host cdr1272.int.cedar.computecanada.ca --port 7077 --webui-port 8080
由于-Xmx1g
火花不起作用。您能通过诊断帮助我吗,为什么是1克?我已经为主机指定了15g的内存。
在同一文件(主输出)中,我可以看到12个具有5个内核且每个内核15g的工作程序:
19/05/07 15:11:25 INFO Master: Registering worker 172.16.140.247:46822 with 5 cores, 15.0 GB RAM
19/05/07 15:11:25 INFO Master: Registering worker 172.16.141.2:41554 with 5 cores, 15.0 GB RAM
19/05/07 15:11:25 INFO Master: Registering worker 172.16.141.2:38652 with 5 cores, 15.0 GB RAM
19/05/07 15:11:25 INFO Master: Registering worker 172.16.140.247:35553 with 5 cores, 15.0 GB RAM
19/05/07 15:11:25 INFO Master: Registering worker 172.16.140.247:43477 with 5 cores, 15.0 GB RAM
19/05/07 15:11:25 INFO Master: Registering worker 172.16.140.247:36128 with 5 cores, 15.0 GB RAM
19/05/07 15:11:25 INFO Master: Registering worker 172.16.140.247:35494 with 5 cores, 15.0 GB RAM
19/05/07 15:11:27 INFO Master: Registering worker 172.16.141.2:34899 with 5 cores, 15.0 GB RAM
19/05/07 15:11:27 INFO Master: Registering worker 172.16.140.247:40010 with 5 cores, 15.0 GB RAM
19/05/07 15:11:29 INFO Master: Registering worker 172.16.141.2:37054 with 5 cores, 15.0 GB RAM
19/05/07 15:11:31 INFO Master: Registering worker 172.16.141.2:37322 with 5 cores, 15.0 GB RAM
19/05/07 15:11:33 INFO Master: Registering worker 172.16.141.2:36519 with 5 cores, 15.0 GB RAM
...
19/05/07 15:11:58 INFO Master: Launching executor app-20190507151158-0000/0 on worker worker-20190507151124-172.16.140.247-36128
19/05/07 15:11:58 INFO Master: Launching executor app-20190507151158-0000/1 on worker worker-20190507151124-172.16.141.2-38652
19/05/07 15:11:58 INFO Master: Launching executor app-20190507151158-0000/2 on worker worker-20190507151124-172.16.141.2-41554
19/05/07 15:11:58 INFO Master: Launching executor app-20190507151158-0000/3 on worker worker-20190507151124-172.16.140.247-43477
19/05/07 15:11:58 INFO Master: Launching executor app-20190507151158-0000/4 on worker worker-20190507151126-172.16.141.2-34899
19/05/07 15:11:58 INFO Master: Launching executor app-20190507151158-0000/5 on worker worker-20190507151128-172.16.141.2-37054
19/05/07 15:11:58 INFO Master: Launching executor app-20190507151158-0000/6 on worker worker-20190507151124-172.16.140.247-35553
19/05/07 15:11:58 INFO Master: Launching executor app-20190507151158-0000/7 on worker worker-20190507151130-172.16.141.2-37322
19/05/07 15:11:58 INFO Master: Launching executor app-20190507151158-0000/8 on worker worker-20190507151124-172.16.140.247-46822
19/05/07 15:11:58 INFO Master: Launching executor app-20190507151158-0000/9 on worker worker-20190507151124-172.16.140.247-35494
19/05/07 15:11:58 INFO Master: Launching executor app-20190507151158-0000/10 on worker worker-20190507151132-172.16.141.2-36519
19/05/07 15:11:58 INFO Master: Launching executor app-20190507151158-0000/11 on worker worker-20190507151127-172.16.140.247-40010
此外,在工作目录中,我看到一个名为app-20190507151158-0000
的子文件夹。后者还具有11个子文件夹0..11。每个都有stderr文件的文件看起来就像一个日志文件。我也注意到每个文件
...
19/05/05 23:49:38 INFO MemoryStore: MemoryStore started with capacity 7.8 GB
我不知道这到底是什么意思?执行器是否只有7.8 GB或15Y TH?