PAI教程示例无法运行。使用' [ExitCode]:177'

时间:2018-06-19 07:30:50

标签: tensorflow openpai

我正在关注PAI工作tutorial

这是我的工作配置:

{
  "jobName": "yuan_tensorflow-distributed-jobguid",
  "image": "docker.io/openpai/pai.run.tensorflow",
  "dataDir": "hdfs://10.11.3.2:9000/yuan/sample/tensorflow",
  "outputDir": "$PAI_DEFAULT_FS_URI/yuan/tensorflow-distributed-jobguid/output",
  "codeDir": "$PAI_DEFAULT_FS_URI/path/tensorflow-distributed-jobguid/code",
  "virtualCluster": "default",
  "taskRoles": [
    {
      "name": "ps_server",
      "taskNumber": 2,
      "cpuNumber": 2,
      "memoryMB": 8192,
      "gpuNumber": 0,
      "portList": [
        {
          "label": "http",
          "beginAt": 0,
          "portNumber": 1
        },
        {
          "label": "ssh",
          "beginAt": 0,
          "portNumber": 1
        }
      ],
      "command": "pip --quiet install scipy && python code/tf_cnn_benchmarks.py --local_parameter_device=cpu --batch_size=32 --model=resnet20 --variable_update=parameter_server --data_dir=$PAI_DATA_DIR --data_name=cifar10 --train_dir=$PAI_OUTPUT_DIR --ps_hosts=$PAI_TASK_ROLE_ps_server_HOST_LIST --worker_hosts=$PAI_TASK_ROLE_worker_HOST_LIST --job_name=ps --task_index=$PAI_CURRENT_TASK_ROLE_CURRENT_TASK_INDEX"
    },
    {
      "name": "worker",
      "taskNumber": 2,
      "cpuNumber": 2,
      "memoryMB": 16384,
      "gpuNumber": 4,
      "portList": [
        {
          "label": "http",
          "beginAt": 0,
          "portNumber": 1
        },
        {
          "label": "ssh",
          "beginAt": 0,
          "portNumber": 1
        }
      ],
      "command": "pip --quiet install scipy && python code/tf_cnn_benchmarks.py --local_parameter_device=cpu --batch_size=32 --model=resnet20 --variable_update=parameter_server --data_dir=$PAI_DATA_DIR --data_name=cifar10 --train_dir=$PAI_OUTPUT_DIR --ps_hosts=$PAI_TASK_ROLE_ps_server_HOST_LIST --worker_hosts=$PAI_TASK_ROLE_worker_HOST_LIST --job_name=worker --task_index=$PAI_CURRENT_TASK_ROLE_CURRENT_TASK_INDEX"
    }
  ],
  "killAllOnCompletedTaskNumber": 2,
  "retryCount": 0
}

工作成功提交,但很快就失败了,大约4分钟后。

以下是我的应用摘要'。

  

开始时间:6/15/2018,晚上8:18:01

     

结束时间:6/15/2018,晚上8:22:31

     

退出诊断:

     

[ExitStatus]:LAUNCHER_EXIT_STATUS_UNDEFINED [ExitCode]:177   [ExitDiagnostics]:可能在Launcher中未定义ExitStatus   UserApplication本身失败了。 [ExitType]:UNKNOWN   ________________________________________________________________________________________________________________________________________________________________________________________________________ [ExitCustomizedDiagnostics]:[ExitCode]:1 [ExitDiagnostics]:   容器启动的例外情况。容器ID:   container_1529064439409_0003_01_000005退出代码:1堆栈跟踪:   ExitCodeException exitCode = 1:at   org.apache.hadoop.util.Shell.runCommand(Shell.java:545)at   org.apache.hadoop.util.Shell.run(Shell.java:456)at   org.apache.hadoop.util.Shell $ ShellCommandExecutor.execute(Shell.java:722)   在   org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.launchContainer(DefaultContainerExecutor.java:212)   在   org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:302)   在   org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:82)   在java.util.concurrent.FutureTask.run(FutureTask.java:266)at   java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)   在   java.util.concurrent.ThreadPoolExecutor中的$ Worker.run(ThreadPoolExecutor.java:624)   在java.lang.Thread.run(Thread.java:748)

     

Shell输出:[ERROR]在纱线容器中收到退出信号,退出   ...

     

使用非零退出代码1退出的容器

     

________________________________________________________________________________________________________________________________________________________________________________________________________ [ExitCustomizedDiagnostics]:

     

worker:TASK_COMPLETED:[TaskStatus]:{" taskIndex" :1,   " taskRoleName" :" worker"," taskState" :" TASK_COMPLETED",   " taskRetryPolicyState" :{" retriedCount" :0," succeededRetriedCount"   :0," transientNormalRetriedCount" :0,   " transientConflictRetriedCount" :0," nonTransientRetriedCount" :0,   " unKnownRetriedCount" :0}," taskCreatedTimestamp" :1529065083290,   " taskCompletedTimestamp" :1529065346772," taskServiceStatus" :{   " serviceVersion" :0}," containerId" :   " container_1529064439409_0003_01_000005"," containerHost" :   " 10.11.1.9"," containerIp" :" 10.11.1.9"," containerPorts" :   " http:2938; ssh:2939;"," containerGpus" :15," containerLogHttpAddress"   :   " http://10.11.1.9:8042/node/containerlogs/container_1529064439409_0003_01_000005/admin/&#34 ;,   " containerConnectionLostCount" :0," containerIsDecommissioning" :   null," containerLaunchedTimestamp" :1529065087200,   " containerCompletedTimestamp" :1529065346768," containerExitCode" :   1," containerExitDiagnostics" :"来自的例外   container-launch。\ nContainer id:   container_1529064439409_0003_01_000005 \ n退出代码:1 \ n堆栈跟踪:   ExitCodeException exitCode = 1:\ n \ tat   org.apache.hadoop.util.Shell.runCommand(Shell.java:545)\ n \达   org.apache.hadoop.util.Shell.run(Shell.java:456)\ n \达   org.apache.hadoop.util.Shell $ ShellCommandExecutor.execute(Shell.java:722)\ n \达   org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.launchContainer(DefaultContainerExecutor.java:212)\ n \达   org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:302)\ n \达   org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:82)\ n \达   java.util.concurrent.FutureTask.run(FutureTask.java:266)\ n \达   java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)\ n \达   java.util.concurrent.ThreadPoolExecutor中的$ Worker.run(ThreadPoolExecutor.java:624)\ n \达   java.lang.Thread.run(Thread.java:748)\ n \ nShell输出:[ERROR] EXIT   在纱线容器中收到的信号,退出... \ n \ n \ n集装箱退出   使用非零退出代码1 \ n"," containerExitType" :" UNKNOWN" }   [ContainerDiagnostics]:容器已完成   hostName 10.11.1.9上的container_1529064439409_0003_01_000005。   ContainerLogHttpAddress:   http://10.11.1.9:8042/node/containerlogs/container_1529064439409_0003_01_000005/admin/   AppCacheNetworkPath:   10.11.1.9:/var/lib/hadoopdata/nm-local-dir/usercache/admin/appcache/application_1529064439409_0003   ContainerLogNetworkPath:   10.11.1.9:/var/lib/yarn/userlogs/application_1529064439409_0003/container_1529064439409_0003_01_000005   ________________________________________________________________________________________________________________________________________________________________________________________________________ [AMStopReason]:任务worker已完成且KillAllOnAny已完成   启用。

找到更多日志详细信息:

[INFO] hdfs_ssh_folder is hdfs://10.11.3.2:9000/Container/admin/yuan_tensorflow-distributed-2/ssh/application_1529064439409_0450
[INFO] task_role_no is 0
[INFO] PAI_TASK_INDEX is 1
[INFO] waitting for ssh key ready
[INFO] waitting for ssh key ready
[INFO] ssh key pair ready ...
[INFO] begin to download ssh key pair from hdfs ...
[INFO] start ssh service
 * Restarting OpenBSD Secure Shell server sshd       [80G 
[74G[ OK ]
[INFO] USER COMMAND START

Traceback (most recent call last):
  File "code/tf_cnn_benchmarks.py", line 38, in <module>
    import benchmark_storage
ImportError: No module named benchmark_storage
[DEBUG] EXIT signal received in docker container, exiting ...

结论:

代码未完成,需要一些依赖项。 下面我提供了一个有效的工作配置。

{
  "jobName": "tensorflow-cifar10",
  "image": "openpai/pai.example.tensorflow",

  "dataDir": "/tmp/data",
  "outputDir": "/tmp/output",

  "taskRoles": [
    {
      "name": "cifar_train",
      "taskNumber": 1,
      "cpuNumber": 8,
      "memoryMB": 32768,
      "gpuNumber": 1,
      "command": "git clone https://github.com/tensorflow/models && cd models/research/slim && python download_and_convert_data.py --dataset_name=cifar10 --dataset_dir=$PAI_DATA_DIR && python train_image_classifier.py --batch_size=64 --model_name=inception_v3 --dataset_name=cifar10 --dataset_split_name=train --dataset_dir=$PAI_DATA_DIR --train_dir=$PAI_OUTPUT_DIR"
    }
  ]
}

2 个答案:

答案 0 :(得分:0)

通常你需要查看所有工人的日志,特别是第一个退出的容器,看看那里发生了什么,因为任何退出的容器都会导致Launcher提前停止工作,因此你可以看到纱线收到的&#34; EXIT信号容器&#34;应用程序诊断内容中的消息。

答案 1 :(得分:0)

失败的作业的日志不会被删除。作业完成后将其移至hdfs。

从您的日志中看,代码似乎丢失了一些文件。请下载整个基准测试文件夹,而不是一个或两个文件(例如cnn基准测试)。