我首先使用spark-shell然后使用spark-submit运行相同的工作。但是,spark-submit需要更长的时间。我在客户端模式下在16节点集群(> 180 Vcores)上运行它。
spark-submit conf:
spark-submit --class tool \
--master yarn \
--deploy-mode client \
--conf "spark.serializer=org.apache.spark.serializer.KryoSerializer" \
--conf "spark.kryo.classesToRegister=com.fastdtw.timeseries.TimeSeriesBase" \
--executor-memory 14g \
--driver-memory 16g \
--conf "spark.driver.maxResultSize=16g" \
--conf "spark.kryoserializer.buffer.max=512" \
--num-executors 30 \
--conf "spark.executor.cores=6" \
/home/target/scala-2.10/tool_2.10-0.1-SNAPSHOT.jar
spark-shell conf:
spark-shell \
--master yarn
--deploy-mode client \
--conf "spark.serializer=org.apache.spark.serializer.KryoSerializer" \
--conf "spark.kryo.classesToRegister=com.fastdtw.timeseries.TimeSeriesBase" \
--executor-memory 12g \
--driver-memory 16g \
--conf "spark.driver.maxResultSize=16g" \
--conf "spark.kryoserializer.buffer.max=512" \
--conf "spark.executor.cores=6" \
--conf "spark.executor.instances=30"
为什么在运行时会有差异?