使用spark-submit的Spark检查点目录

时间:2019-03-30 16:34:34

标签: apache-spark kubernetes

我正在Kubernetes集群上使用spark-submit.sh运行Spark。我需要检查数据,但仍然出错:

2019-03-30 16:23:41 WARN  ReliableCheckpointRDD:66 - Error writing partitioner VertexPartitioner(30,ProblemTree(192)) to file:/tmp/checkpoints/2d0f0d68-e8bc-4209-a856-31016478faa0/rdd-18
2019-03-30 16:23:41 INFO  ReliableCheckpointRDD:54 - Checkpointing took 613 ms.
2019-03-30 16:23:41 INFO  MemoryStore:54 - Block broadcast_10 stored as values in memory (estimated size 236.7 KB, free 2.1 GB)
2019-03-30 16:23:41 INFO  MemoryStore:54 - Block broadcast_10_piece0 stored as bytes in memory (estimated size 22.9 KB, free 2.1 GB)
2019-03-30 16:23:41 INFO  BlockManagerInfo:54 - Added broadcast_10_piece0 in memory on iga-adi-graph-1553962941326-driver-svc.default.svc:7079 (size: 22.9 KB, free: 2.1 GB)
2019-03-30 16:23:41 INFO  SparkContext:54 - Created broadcast 10 from reduce at VertexRDDImpl.scala:90
Exception in thread "main" org.apache.spark.SparkException: Checkpoint RDD has a different number of partitions from original RDD. Original RDD [ID: 18, num of partitions: 30]; Checkpoint RDD [ID: 40, num of partitions: 0].
    at org.apache.spark.rdd.ReliableCheckpointRDD$.writeRDDToCheckpointDirectory(ReliableCheckpointRDD.scala:154)
    at org.apache.spark.rdd.ReliableRDDCheckpointData.doCheckpoint(ReliableRDDCheckpointData.scala:58)
    at org.apache.spark.rdd.RDDCheckpointData.checkpoint(RDDCheckpointData.scala:75)
    at org.apache.spark.rdd.RDD$$anonfun$doCheckpoint$1.apply$mcV$sp(RDD.scala:1766)
    at org.apache.spark.rdd.RDD$$anonfun$doCheckpoint$1.apply(RDD.scala:1756)
    at org.apache.spark.rdd.RDD$$anonfun$doCheckpoint$1.apply(RDD.scala:1756)
    at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
    at org.apache.spark.rdd.RDD.doCheckpoint(RDD.scala:1755)

在之前:

Spark is not running in local mode, therefore the checkpoint directory must not be on the local filesystem. Directory '/tmp/checkpoints' appears to be on the local filesystem.

以下是我正在使用的选项:

bin/spark-submit \
    --master k8s://http://localhost:8001 \
    --deploy-mode cluster \
    --name iga-adi-graph \
    --driver-cores 3 \
    --driver-memory 5G \
    --executor-cores 3 \
    --executor-memory 6G \
    --conf spark.executor.instances=10 \
    --conf spark.default.parallelism=30 \
    --conf spark.kubernetes.executor.request.cores=3000m \
    --conf spark.kubernetes.executor.limit.cores=3000m \
    --conf spark.kubernetes.memoryOverheadFactor=0.2 \
    --conf spark.kubernetes.container.image.pullPolicy=Always \
    --conf spark.kubernetes.container.image=kbhit/iga-adi-pregel \
    --conf spark.scheduler.minRegisteredResourcesRatio=1.0 \
    --conf spark.scheduler.maxRegisteredResourcesWaitingTime=300s \
    --files /opt/metrics.properties \
    --conf spark.metrics.conf=/opt/metrics.properties \
    --jars /opt/metrics-influxdb.jar,/opt/spark-influx-sink.jar \
    --conf spark.driver.extraClassPath=spark-influx-sink.jar:metrics-influxdb.jar  \
    --conf spark.executor.extraClassPath=/opt/spark-influx-sink.jar:/opt/metrics-influxdb.jar  \
    --conf spark.executor.extraJavaOptions="" \
    --conf spark.driver.extraJavaOptions="-Dproblem.size=192 -Dproblem.steps=1" \
    --conf spark.kryo.unsafe=true \
    --conf spark.kryoserializer.buffer=32m \
    --conf spark.network.timeout=360s \
    --conf spark.memory.fraction=0.5 \
    --conf spark.cleaner.periodicGC.interval=10s \
    --conf spark.locality.wait.node=0 \
    --conf spark.locality.wait=9999999 \
    --conf spark.kubernetes.executor.volumes.emptyDir.mycheckpoints.mount.path=/tmp/checkpoints \
    --conf spark.kubernetes.executor.volumes.emptyDir.mycheckpoints.mount.readOnly=false \
    --conf spark.kubernetes.driver.volumes.emptyDir.mycheckpoints.mount.path=/tmp/checkpoints \
    --conf spark.kubernetes.driver.volumes.emptyDir.mycheckpoints.mount.readOnly=false \
    --class edu.agh.kboom.iga.adi.graph.IgaAdiPregelSolver \
    local:///opt/iga-adi-pregel.jar &

Spark Kubernetes Manual提到了几种安装Kubernetes卷的选项,这表明使用它进行检查点是非常好的:

spark.kubernetes.executor.volumes.emptyDir.mycheckpoints.mount.path

是否可以将kubernetes卷安装(emptyDir)用于检查点?

0 个答案:

没有答案