我正在kubernetes集群上运行Spark 2.3作业
kubectl版本
Client Version: version.Info{Major:"1", Minor:"9", GitVersion:"v1.9.3", GitCommit:"d2835416544f298c919e2ead3be3d0864b52323b", GitTreeState:"clean", BuildDate:"2018-02-09T21:51:06Z", GoVersion:"go1.9.4", Compiler:"gc", Platform:"darwin/amd64"}
Server Version: version.Info{Major:"1", Minor:"8", GitVersion:"v1.8.3", GitCommit:"f0efb3cb883751c5ffdbe6d515f3cb4fbe7b7acd", GitTreeState:"clean", BuildDate:"2017-11-08T18:27:48Z", GoVersion:"go1.8.3", Compiler:"gc", Platform:"linux/amd64"}
当我在k8s master上运行spark Submit时,驱动程序窗格被困在Waiting:PodInitializing状态。
如果我几乎并行提交作业,即一次提交5个作业,就会发生这种情况。
我在运行驱动程序pod的节点上尝试了 kubectl描述节点,这是我得到的,我确实看到资源上的过量提交,但是我期望kubernetes调度程序不会调度是否有资源在节点中过量使用或节点处于未就绪状态,在这种情况下,节点处于 Ready状态,但是如果节点处于未就绪状态。
Name: **********
Roles: worker
Labels: beta.kubernetes.io/arch=amd64
beta.kubernetes.io/os=linux
kubernetes.io/hostname=****
node-role.kubernetes.io/worker=true
Annotations: node.alpha.kubernetes.io/ttl=0
volumes.kubernetes.io/controller-managed-attach-detach=true
Taints: <none>
CreationTimestamp: Tue, 31 Jul 2018 09:59:24 -0400
Conditions:
Type Status LastHeartbeatTime LastTransitionTime Reason Message
---- ------ ----------------- ------------------ ------ -------
OutOfDisk False Tue, 14 Aug 2018 09:31:20 -0400 Tue, 31 Jul 2018 09:59:24 -0400 KubeletHasSufficientDisk kubelet has sufficient disk space available
MemoryPressure False Tue, 14 Aug 2018 09:31:20 -0400 Tue, 31 Jul 2018 09:59:24 -0400 KubeletHasSufficientMemory kubelet has sufficient memory available
DiskPressure False Tue, 14 Aug 2018 09:31:20 -0400 Tue, 31 Jul 2018 09:59:24 -0400 KubeletHasNoDiskPressure kubelet has no disk pressure
Ready True Tue, 14 Aug 2018 09:31:20 -0400 Sat, 11 Aug 2018 00:41:27 -0400 KubeletReady kubelet is posting ready status. AppArmor enabled
Addresses:
InternalIP: *****
Hostname: ******
Capacity:
cpu: 16
memory: 125827288Ki
pods: 110
Allocatable:
cpu: 16
memory: 125724888Ki
pods: 110
System Info:
Machine ID: *************
System UUID: **************
Boot ID: 1493028d-0a80-4f2f-b0f1-48d9b8910e9f
Kernel Version: 4.4.0-1062-aws
OS Image: Ubuntu 16.04.4 LTS
Operating System: linux
Architecture: amd64
Container Runtime Version: docker://Unknown
Kubelet Version: v1.8.3
Kube-Proxy Version: v1.8.3
PodCIDR: ******
ExternalID: **************
Non-terminated Pods: (11 in total)
Namespace Name CPU Requests CPU Limits Memory Requests Memory Limits
--------- ---- ------------ ---------- --------------- -------------
kube-system calico-node-gj5mb 250m (1%) 0 (0%) 0 (0%) 0 (0%)
kube-system kube-proxy-**************************************** 100m (0%) 0 (0%) 0 (0%) 0 (0%)
kube-system prometheus-prometheus-node-exporter-9cntq 100m (0%) 200m (1%) 30Mi (0%) 50Mi (0%)
logging elasticsearch-elasticsearch-data-69df997486-gqcwg 400m (2%) 1 (6%) 8Gi (6%) 16Gi (13%)
logging fluentd-fluentd-elasticsearch-tj7nd 200m (1%) 0 (0%) 612Mi (0%) 0 (0%)
rook rook-agent-6jtzm 0 (0%) 0 (0%) 0 (0%) 0 (0%)
rook rook-ceph-osd-*****-gwb8j 0 (0%) 0 (0%) 0 (0%) 0 (0%)
spark accelerate-test-5-a3bfb8a597e83d459193a183e17f13b5-exec-1 2 (12%) 0 (0%) 10Gi (8%) 12Gi (10%)
spark accelerate-testing-1-8ed0482f3bfb3c0a83da30bb7d433dff-exec-5 2 (12%) 0 (0%) 10Gi (8%) 12Gi (10%)
spark accelerate-testing-2-8cecc18bb42f31a386c6304bd63e9eba-driver 1 (6%) 0 (0%) 2Gi (1%) 2432Mi (1%)
spark accelerate-testing-2-e8bd0607cc693bc8ae25cc6dc300b2c7-driver 1 (6%) 0 (0%) 2Gi (1%) 2432Mi (1%)
Allocated resources:
(Total limits may be over 100 percent, i.e., overcommitted.)
CPU Requests CPU Limits Memory Requests Memory Limits
------------ ---------- --------------- -------------
7050m (44%) 1200m (7%) 33410Mi (27%) 45874Mi (37%)
Events: <none>
Kubectl描述Pod给出以下消息
Name: accelerate-testing-2-8cecc18bb42f31a386c6304bd63e9eba-driver
Namespace: spark
Node: ****
Start Time: Mon, 13 Aug 2018 16:18:34 -0400
Labels: launch-id=k8s-submit-service-cddf45ff-0d88-4681-af85-d8ed0359ce73
spark-app-selector=spark-63f536fd87f8457796802767922ef7d9
spark-role=driver
Annotations: spark-app-name=accelerate-testing-2
Status: Pending
IP:
Init Containers:
spark-init:
Container ID:
Image: ****:v2.3.0
Image ID:
Port: <none>
Args:
init
/etc/spark-init/spark-init.properties
State: Waiting
Reason: PodInitializing
Ready: False
Restart Count: 0
Environment: <none>
Mounts:
/etc/spark-init from spark-init-properties (rw)
/var/run/secrets/kubernetes.io/serviceaccount from spark-token-mj86g (ro)
/var/spark-data/spark-files from download-files-volume (rw)
/var/spark-data/spark-jars from download-jars-volume (rw)
Containers:
spark-kubernetes-driver:
Container ID:
Image: ******:v2.3.0
Image ID:
Port: <none>
Args:
driver
State: Waiting
Reason: PodInitializing
Ready: False
Restart Count: 0
Limits:
memory: 2432Mi
Requests:
cpu: 1
memory: 2Gi
Environment:
SPARK_DRIVER_MEMORY: 2g
SPARK_DRIVER_CLASS: com.myclass
SPARK_DRIVER_BIND_ADDRESS: (v1:status.podIP)
SPARK_MOUNTED_CLASSPATH: /var/spark-data/spark-jars/quantum-workflow-2.2.24.0-SNAPSHOT-assembly.jar:/var/spark-data/spark-jars/my.jar
SPARK_MOUNTED_FILES_DIR: /var/spark-data/spark-files
SPARK_JAVA_OPT_0: -Dspark.kubernetes.container.image=***
SPARK_JAVA_OPT_1: -Dspark.jars=s3a://my/my.jar,s3a://my/my1.jar
SPARK_JAVA_OPT_2: -Dspark.submit.deployMode=cluster
SPARK_JAVA_OPT_3: -Dspark.driver.blockManager.port=7079
SPARK_JAVA_OPT_4: -Dspark.executor.memory=10g
SPARK_JAVA_OPT_5: -Dspark.app.id=spark-63f536fd87f8457796802767922ef7d9
SPARK_JAVA_OPT_6: -Dspark.kubernetes.authenticate.driver.serviceAccountName=spark
SPARK_JAVA_OPT_7: -Dspark.master=k8s://https://kubernetes.default
SPARK_JAVA_OPT_8: -Dspark.driver.host=spark-1534191513364-driver-svc.spark.svc
SPARK_JAVA_OPT_9: -Dspark.executor.cores=2
SPARK_JAVA_OPT_10: -Dspark.kubernetes.executor.podNamePrefix=accelerate-testing-2-8cecc18bb42f31a386c6304bd63e9eba
SPARK_JAVA_OPT_11: -Dspark.driver.port=7078
SPARK_JAVA_OPT_12: -Dspark.kubernetes.namespace=spark
SPARK_JAVA_OPT_13: -Dspark.executor.memoryOverhead=2g
SPARK_JAVA_OPT_14: -Dspark.kubernetes.initContainer.configMapName=accelerate-testing-2-8cecc18bb42f31a386c6304bd63e9eba-init-config
SPARK_JAVA_OPT_15: -Dspark.kubernetes.initContainer.configMapKey=spark-init.properties
SPARK_JAVA_OPT_16: -Dspark.executor.instances=10
SPARK_JAVA_OPT_17: -Dspark.memory.fraction=0.6
SPARK_JAVA_OPT_18: -Dspark.driver.memory=2g
SPARK_JAVA_OPT_19: -Dspark.kubernetes.driver.pod.name=accelerate-testing-2-8cecc18bb42f31a386c6304bd63e9eba-driver
SPARK_JAVA_OPT_20: -Dspark.app.name=accelerate-testing-2
SPARK_JAVA_OPT_21: -Dspark.kubernetes.driver.label.launch-id=********
Mounts:
/var/run/secrets/kubernetes.io/serviceaccount from spark-token-mj86g (ro)
/var/spark-data/spark-files from download-files-volume (rw)
/var/spark-data/spark-jars from download-jars-volume (rw)
Conditions:
Type Status
Initialized False
Ready False
PodScheduled True
Volumes:
spark-init-properties:
Type: ConfigMap (a volume populated by a ConfigMap)
Name: accelerate-testing-2-8cecc18bb42f31a386c6304bd63e9eba-init-config
Optional: false
download-jars-volume:
Type: EmptyDir (a temporary directory that shares a pod's lifetime)
Medium:
download-files-volume:
Type: EmptyDir (a temporary directory that shares a pod's lifetime)
Medium:
spark-token-mj86g:
Type: Secret (a volume populated by a Secret)
SecretName: spark-token-mj86g
Optional: false
QoS Class: Burstable
Node-Selectors: <none>
Tolerations: <none>
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal SandboxChanged 44m (x518 over 18h) kubelet, **************************** Pod sandbox changed, it will be killed and re-created.
Warning FailedSync 19s (x540 over 18h) kubelet, **************************** Error syncing pod
我还尝试过kubectl top nodes
,没有一个节点在资源上被过量使用