Kubeflow管道无法获取指标

时间:2020-09-03 04:46:23

标签: kubeflow

我有下面的日志。在我的训练代码中,我将精度保存到路径/accuracy.json,并将包含此精度的度量保存到路径/mlpipeline-metrics.json。 Json文件已正确创建,但kubeflow管道(或来自其上层日志的Argo)似乎无法拾取Json文件。

│ wait time="2020-09-03T04:07:19Z" level=info msg="Copying /mlpipeline-metrics.json from container base image layer to /argo/outputs/artifacts/mlpipeline-metrics.tgz"
│ wait time="2020-09-03T04:07:19Z" level=info msg="Archiving :/mlpipeline-metrics.json to /argo/outputs/artifacts/mlpipeline-metrics.tgz"
│ wait time="2020-09-03T04:07:19Z" level=info msg="sh -c docker cp -a :/mlpipeline-metrics.json - | gzip > /argo/outputs/artifacts/mlpipeline-metrics.tgz"
│ wait time="2020-09-03T04:07:19Z" level=warning msg="path /mlpipeline-metrics.json does not exist (or /mlpipeline-metrics.json is empty) in archive /argo/outputs/artifacts/mlpipeline-metri
│ cs.tgz"
│ wait time="2020-09-03T04:07:19Z" level=warning msg="Ignoring optional artifact 'mlpipeline-metrics' which does not exist in path '/mlpipeline-metrics.json': path /mlpipeline-metrics.json
│ does not exist (or /mlpipeline-metrics.json is empty) in archive /argo/outputs/artifacts/mlpipeline-metrics.tgz"
│ wait time="2020-09-03T04:07:19Z" level=info msg="Staging artifact: transformer-pytorch-train-job-acc"
│ wait time="2020-09-03T04:07:19Z" level=info msg="Copying /accuracy.json from container base image layer to /argo/outputs/artifacts/transformer-pytorch-train-job-acc.tgz"
│ wait time="2020-09-03T04:07:19Z" level=info msg="Archiving :/accuracy.json to /argo/outputs/artifacts/transformer-pytorch-train-job-acc.tgz"
│ wait time="2020-09-03T04:07:19Z" level=info msg="sh -c docker cp -a :/accuracy.json - | gzip > /argo/outputs/artifacts/transformer-pytorch-train-job-acc.tgz"
│ wait time="2020-09-03T04:07:19Z" level=warning msg="path /accuracy.json does not exist (or /accuracy.json is empty) in archive /argo/outputs/artifacts/transformer-pytorch-train-job-acc.tg
│ z"
│ wait time="2020-09-03T04:07:19Z" level=error msg="executor error: path /accuracy.json does not exist (or /accuracy.json is empty) in archive /argo/outputs/artifacts/transformer-pytorch-tr
│ ain-job-acc.tgz\ngithub.com/argoproj/argo/errors.New\n\t/go/src/github.com/argoproj/argo/errors/errors.go:49\ngithub.com/argoproj/argo/errors.Errorf\n\t/go/src/github.com/argoproj/argo/er
│ rors/errors.go:55\ngithub.com/argoproj/argo/workflow/executor/docker.(*DockerExecutor).CopyFile\n\t/go/src/github.com/argoproj/argo/workflow/executor/docker/docker.go:66\ngithub.com/argop
│ roj/argo/workflow/executor.(*WorkflowExecutor).stageArchiveFile\n\t/go/src/github.com/argoproj/argo/workflow/executor/executor.go:344\ngithub.com/argoproj/argo/workflow/executor.(*Workflo
│ wExecutor).saveArtifact\n\t/go/src/github.com/argoproj/argo/workflow/executor/executor.go:245\ngithub.com/argoproj/argo/workflow/executor.(*WorkflowExecutor).SaveArtifacts\n\t/go/src/gith
│ ub.com/argoproj/argo/workflow/executor/executor.go:231\ngithub.com/argoproj/argo/cmd/argoexec/commands.waitContainer\n\t/go/src/github.com/argoproj/argo/cmd/argoexec/commands/wait.go:54\n
│ github.com/argoproj/argo/cmd/argoexec/commands.NewWaitCommand.func1\n\t/go/src/github.com/argoproj/argo/cmd/argoexec/commands/wait.go:16\ngithub.com/spf13/cobra.(*Command).execute\n\t/go/
│ src/github.com/spf13/cobra/command.go:766\ngithub.com/spf13/cobra.(*Command).ExecuteC\n\t/go/src/github.com/spf13/cobra/command.go:852\ngithub.com/spf13/cobra.(*Command).Execute\n\t/go/sr
│ c/github.com/spf13/cobra/command.go:800\nmain.main\n\t/go/src/github.com/argoproj/argo/cmd/argoexec/main.go:17\nruntime.main\n\t/usr/local/go/src/runtime/proc.go:201\nruntime.goexit\n\t/u
│ sr/local/go/src/runtime/asm_amd64.s:1333"

我正在使用的管道代码如下。如果我理解正确,那么容器将保存指标并符合我指定的Json文件路径。然后,Argo将拾取这些文件并在我的Kubeflow UI中呈现输出。但是,获取上面的日志会使我感到困惑。任何想法或建议都会对我有很大帮助。

@dsl.pipeline(
    name="PyTorch Job",
    description="Example Tutorial"
)
def containerop_basic():
    op = dsl.ContainerOp(
        name='pytorch-train-job',
        image='From our ECR',
        file_outputs={
          'acc': '/accuracy.json',
          'mlpipeline-metrics': '/mlpipeline-metrics.json'
        }
    )


if __name__ == '__main__':
    kfp.compiler.Compiler().compile(containerop_basic, __file__ + '.yaml')

2 个答案:

答案 0 :(得分:1)

我解决了这个问题。这是Argo的授权问题。在执行管道时,Argo需要角色来“监视”吊舱。因此,通过将角色添加到所使用的服务帐户中,解决了该问题。

答案 1 :(得分:0)

指定file_outputs={'kfp_reference_name': 'file_location'}字典时,您实际上是在告诉KFP,当容器运行结束时,KFP应该在file_location中查找文件并将其复制到新位置可以使用kfp_reference_name通过管道的其他步骤来访问它(我不会介绍,但是基本上是使用在Kubeflow安装期间部署的Minio服务器完成的)。

从您的日志中看来,您的问题似乎是,当KFP在容器中查找本地文件时,该文件在指定位置不可用,这意味着您的问题可能是两个之一-

  1. 您的容器将文件保存到另一个位置。例如,它可能会将文件与代码保存到同一文件夹,假设它位于src文件夹下,然后将代码更改为以下代码即可-
file_outputs={
    'acc': '/src/accuracy.json',
    'mlpipeline-metrics': '/src/mlpipeline-metrics.json'
}
  1. 您的容器根本不保存文件,这意味着您在代码/ Dockerfile配置中的某处存在问题。

通常,我还建议您仔细阅读Kubeflow的数据传递教程,它是有关该主题的最佳资源之一- https://github.com/kubeflow/pipelines/blob/master/samples/tutorials/Data%20passing%20in%20python%20components.ipynb