我对kubernetes和tensorflow还是相当陌生,尝试从此链接(https://github.com/learnk8s/distributed-tensorflow-on-k8s)运行基本的kubeflow分布式tensorflow示例。我目前正在运行具有2个节点(1个主控器和1个工作器)的本地裸机kubernetes集群。当我在minikube中运行它时,一切正常(遵循文档),培训和服务均成功运行。但是在本地群集上运行作业会给我这个错误!
任何帮助将不胜感激。
对于此设置,我为作业创建了一个用于nfs-storage的容器。由于本地群集未启用动态配置,因此我手动创建了永久卷(使用的文件已附加)。
Nfs Pod存储文件:
kind: Service
apiVersion: v1
metadata:
name: nfs-service
spec:
selector:
role: nfs-service
ports:
# Open the ports required by the NFS server
- name: nfs
port: 2049
- name: mountd
port: 20048
- name: rpcbind
port: 111
---
kind: Pod
apiVersion: v1
metadata:
name: nfs-server-pod
labels:
role: nfs-service
spec:
containers:
- name: nfs-server-container
image: cpuguy83/nfs-server
securityContext:
privileged: true
args:
# Pass the paths to share to the Docker image
- /exports
永久卷和PVC文件:
apiVersion: v1
kind: PersistentVolume
metadata:
name: nfs
spec:
storageClassName: "standard"
capacity:
storage: 10Gi
accessModes:
- ReadWriteMany
nfs:
server: 10.96.72.11
path: "/"
---
kind: PersistentVolumeClaim
apiVersion: v1
metadata:
name: nfs
spec:
accessModes:
- ReadWriteMany
storageClassName: "standard"
resources:
requests:
storage: 10Gi
TFJob文件:
apiVersion: kubeflow.org/v1beta1
kind: TFJob
metadata:
name: tfjob1
spec:
replicaSpecs:
- replicas: 1
tfReplicaType: MASTER
template:
spec:
volumes:
- name: nfs-volume
persistentVolumeClaim:
claimName: nfs
containers:
- name: tensorflow
image: learnk8s/mnist:1.0.0
imagePullPolicy: IfNotPresent
args:
- --model_dir
- ./out/vars
- --export_dir
- ./out/models
volumeMounts:
- mountPath: /app/out
name: nfs-volume
restartPolicy: OnFailure
- replicas: 2
tfReplicaType: WORKER
template:
spec:
containers:
- name: tensorflow
image: learnk8s/mnist:1.0.0
imagePullPolicy: IfNotPresent
restartPolicy: OnFailure
args:
- --model_dir
- ./out/vars
- --export_dir
- ./out/models
volumeMounts:
- mountPath: /app/out
name: nfs-volume
restartPolicy: OnFailure
- replicas: 2
tfReplicaType: WORKER
template:
spec:
containers:
- name: tensorflow
image: learnk8s/mnist:1.0.0
imagePullPolicy: IfNotPresent
restartPolicy: OnFailure
- replicas: 1
tfReplicaType: PS
template:
spec:
volumes:
- name: nfs-volume
persistentVolumeClaim:
claimName: nfs
containers:
- name: tensorflow
image: learnk8s/mnist:1.0.0
imagePullPolicy: IfNotPresent
volumeMounts:
- mountPath: /app/out
name: nfs-volume
restartPolicy: OnFailure
我做这份工作时,它给了我这个错误
error: unable to recognize "kube/tfjob.yaml": no matches for kind "TFJob" in version "kubeflow.org/v1alpha1"
稍作搜索后,有人指出“ v1alpha1”可能已过时,因此您应该使用“ v1beta1”(奇怪的是,此“ v1alpha1”与我的minikube设置一起使用,所以我很困惑!)。但是,尽管创建了tfjob,但与minikube运行相反,我看不到任何新的容器启动,在minikube运行中,新容器成功启动和完成。当我描述Tfjob时,我看到此错误
Type Reason Age From Message
---- ------ ---- ---- -------
Warning InvalidTFJobSpec 22s tf-operator Failed to marshal the object to TFJob; the spec is invalid: failed to marshal the object to TFJob"
由于唯一的区别是nfs-storage,我认为手动设置可能有问题。如果我在某个地方搞砸了,请告诉我,因为我没有足够的背景知识!
答案 0 :(得分:0)
我发现了引起特定错误的问题。首先,api版本已更改,因此我不得不从v1alpha1
转到v1beta2
。其次,我遵循的教程使用的是kubeflow v0.1.2(相对较旧),此后在yaml文件中定义tfjob的语法发生了变化(不完全确定更改发生在哪个版本中!)。因此,通过查看git中的最新示例,我可以更新作业规范。这是有兴趣的人的文件!
教程版本:
apiVersion: kubeflow.org/v1alpha1
kind: TFJob
metadata:
name: tfjob1
spec:
replicaSpecs:
- replicas: 1
tfReplicaType: MASTER
template:
spec:
volumes:
- name: nfs-volume
persistentVolumeClaim:
claimName: nfs
containers:
- name: tensorflow
image: learnk8s/mnist:1.0.0
imagePullPolicy: IfNotPresent
args:
- --model_dir
- ./out/vars
- --export_dir
- ./out/models
volumeMounts:
- mountPath: /app/out
name: nfs-volume
restartPolicy: OnFailure
- replicas: 2
tfReplicaType: WORKER
template:
spec:
containers:
- name: tensorflow
image: learnk8s/mnist:1.0.0
imagePullPolicy: IfNotPresent
restartPolicy: OnFailure
- replicas: 1
tfReplicaType: PS
template:
spec:
volumes:
- name: nfs-volume
persistentVolumeClaim:
claimName: nfs
containers:
- name: tensorflow
image: learnk8s/mnist:1.0.0
imagePullPolicy: IfNotPresent
volumeMounts:
- mountPath: /app/out
name: nfs-volume
restartPolicy: OnFailure
更新版本:
apiVersion: kubeflow.org/v1beta2
kind: TFJob
metadata:
name: tfjob1
spec:
tfReplicaSpecs:
Chief:
replicas: 1
template:
spec:
volumes:
- name: nfs-volume
persistentVolumeClaim:
claimName: nfs
containers:
- name: tensorflow
image: learnk8s/mnist:1.0.0
imagePullPolicy: IfNotPresent
args:
- --model_dir
- ./out/vars
- --export_dir
- ./out/models
volumeMounts:
- mountPath: /app/out
name: nfs-volume
restartPolicy: OnFailure
Worker:
replicas: 2
template:
spec:
containers:
- name: tensorflow
image: learnk8s/mnist:1.0.0
imagePullPolicy: IfNotPresent
restartPolicy: OnFailure
PS:
replicas: 1
template:
spec:
volumes:
- name: nfs-volume
persistentVolumeClaim:
claimName: nfs
containers:
- name: tensorflow
image: learnk8s/mnist:1.0.0
imagePullPolicy: IfNotPresent
volumeMounts:
- mountPath: /app/out
name: nfs-volume
restartPolicy: OnFailure