我想通过纱线并行运行一堆火花作业,然后等待所有火花作业完成后再启动另一组作业。我如何确定第一组工作何时完成?谢谢。
答案 0 :(得分:1)
示例解决方法;
在spark-submit命令中为您的spark作业赋予唯一的名称。
spark-submit --master yarn-cluster --name spark_job_name job1.jar
检查纱线,火花作业是否在运行。如果没有运行,请运行第二个作业。下面的Bash脚本
JOB="spark_job_name"
applicationId=$(yarn application -list -appStates RUNNING | awk -v tmpJob=$JOB '{ if( $2 == tmpJob) print $1 }')
if [ ! -z $applicationId ]
then
echo " "
echo "JOB: ${JOB} is already running. ApplicationId : ${applicationId}"
echo " "
else
printf "first job is not running. Starting the spark job. ${JOB}\n"
echo " "
spark-submit --master yarn-cluster --name spark_job_name2 job2.jar
echo " "
fi