无法在Google Cloud ML中训练

时间:2018-10-15 18:03:16

标签: tensorflow google-cloud-ml

我无法使用ML Engine进行培训。训练总是在迭代60左右停止。我使用Keras构建模型层,但是我使用tf.Session进行训练。

我收到此错误,但没有回溯。

ERROR   2018-10-15 10:31:02 -0700   master-replica-0        name: Tesla P100-PCIE-16GB major: 6 minor: 0 memoryClockRate(GHz): 1.3285
ERROR   2018-10-15 10:31:02 -0700   master-replica-0        pciBusID: 0000:00:04.0
ERROR   2018-10-15 10:31:02 -0700   master-replica-0        totalMemory: 15.90GiB freeMemory: 15.61GiB

我的config.yaml。我尝试了不同的配置。结果相同。

trainingInput:
  scaleTier: CUSTOM
  masterType: standard_p100

工作提交。

gcloud ml-engine jobs submit training $JOB_NAME --labels="$LABELS" --verbosity='debug' --stream-logs --package-path=./job --module-name=job.task --staging-bucket="$TRAIN_BUCKET" --region=us-central1 --runtime-version 1.10 --config=job/config.yaml

完整日志

INFO    2018-10-15 10:28:37 -0700   service     Validating job requirements...
INFO    2018-10-15 10:28:38 -0700   service     Job creation request has been successfully validated.
INFO    2018-10-15 10:28:38 -0700   service     Job <JOB_NAME> is queued.
INFO    2018-10-15 10:28:38 -0700   service     Waiting for job to be provisioned.
INFO    2018-10-15 10:28:41 -0700   service     Waiting for training program to start.
INFO    2018-10-15 10:30:03 -0700   master-replica-0        Running task with arguments: --cluster={"master": ["127.0.0.1:2222"]} --task={"type": "master", "index": 0} --job={  "scale_tier": "CUSTOM",  "master_type": "standard_p100",  "package_uris": ["gs://annotator-1286-ml/<JOB_NAME>/5b038627d10c914d6309269cefff8d2e0682f87f441bdb8c547a05e8ed1107a7/job-0.0.0.tar.gz"],  "python_module": "job.task",  "region": "us-central1",  "runtime_version": "1.10",  "run_on_raw_vm": true}
INFO    2018-10-15 10:30:15 -0700   master-replica-0        Running module job.task.
INFO    2018-10-15 10:30:15 -0700   master-replica-0        Downloading the package: gs://annotator-1286-ml/<JOB_NAME>/5b038627d10c914d6309269cefff8d2e0682f87f441bdb8c547a05e8ed1107a7/job-0.0.0.tar.gz
INFO    2018-10-15 10:30:15 -0700   master-replica-0        Running command: gsutil -q cp gs://annotator-1286-ml/<JOB_NAME>/5b038627d10c914d6309269cefff8d2e0682f87f441bdb8c547a05e8ed1107a7/job-0.0.0.tar.gz job-0.0.0.tar.gz
INFO    2018-10-15 10:30:22 -0700   master-replica-0        Installing the package: gs://annotator-1286-ml/<JOB_NAME>/5b038627d10c914d6309269cefff8d2e0682f87f441bdb8c547a05e8ed1107a7/job-0.0.0.tar.gz
INFO    2018-10-15 10:30:22 -0700   master-replica-0        Running command: pip install --user --upgrade --force-reinstall --no-deps job-0.0.0.tar.gz
INFO    2018-10-15 10:30:28 -0700   master-replica-0        Processing ./job-0.0.0.tar.gz
INFO    2018-10-15 10:30:29 -0700   master-replica-0        Building wheels for collected packages: job
INFO    2018-10-15 10:30:29 -0700   master-replica-0          Running setup.py bdist_wheel for job: started
INFO    2018-10-15 10:30:29 -0700   master-replica-0          Running setup.py bdist_wheel for job: finished with status 'done'
INFO    2018-10-15 10:30:29 -0700   master-replica-0          Stored in directory: /root/.cache/pip/wheels/b8/10/df/bb59eda2baac79b36fbdb8e5305ada7d6bf7779be49c3c5a0d
INFO    2018-10-15 10:30:29 -0700   master-replica-0        Successfully built job
INFO    2018-10-15 10:30:29 -0700   master-replica-0        Installing collected packages: job
INFO    2018-10-15 10:30:29 -0700   master-replica-0        Successfully installed job-0.0.0
INFO    2018-10-15 10:30:30 -0700   master-replica-0        Running command: pip install --user job-0.0.0.tar.gz
INFO    2018-10-15 10:30:30 -0700   master-replica-0        Processing ./job-0.0.0.tar.gz
INFO    2018-10-15 10:30:30 -0700   master-replica-0        Building wheels for collected packages: job
INFO    2018-10-15 10:30:30 -0700   master-replica-0          Running setup.py bdist_wheel for job: started
INFO    2018-10-15 10:30:30 -0700   master-replica-0          Running setup.py bdist_wheel for job: finished with status 'done'
INFO    2018-10-15 10:30:30 -0700   master-replica-0          Stored in directory: /root/.cache/pip/wheels/b8/10/df/bb59eda2baac79b36fbdb8e5305ada7d6bf7779be49c3c5a0d
INFO    2018-10-15 10:30:30 -0700   master-replica-0        Successfully built job
INFO    2018-10-15 10:30:31 -0700   master-replica-0        Installing collected packages: job
INFO    2018-10-15 10:30:31 -0700   master-replica-0          Found existing installation: job 0.0.0
INFO    2018-10-15 10:30:31 -0700   master-replica-0            Uninstalling job-0.0.0:
INFO    2018-10-15 10:30:31 -0700   master-replica-0              Successfully uninstalled job-0.0.0
INFO    2018-10-15 10:30:31 -0700   master-replica-0        Successfully installed job-0.0.0
INFO    2018-10-15 10:30:31 -0700   master-replica-0        Running command: python -m job.task
INFO    2018-10-15 10:31:02 -0700   master-replica-0        successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
INFO    2018-10-15 10:31:02 -0700   master-replica-0        Found device 0 with properties: 
ERROR   2018-10-15 10:31:02 -0700   master-replica-0        name: Tesla P100-PCIE-16GB major: 6 minor: 0 memoryClockRate(GHz): 1.3285
ERROR   2018-10-15 10:31:02 -0700   master-replica-0        pciBusID: 0000:00:04.0
ERROR   2018-10-15 10:31:02 -0700   master-replica-0        totalMemory: 15.90GiB freeMemory: 15.61GiB
INFO    2018-10-15 10:31:02 -0700   master-replica-0        Adding visible gpu devices: 0
INFO    2018-10-15 10:31:03 -0700   master-replica-0        Device interconnect StreamExecutor with strength 1 edge matrix:
INFO    2018-10-15 10:31:03 -0700   master-replica-0             0 
INFO    2018-10-15 10:31:03 -0700   master-replica-0        0:   N 
INFO    2018-10-15 10:31:03 -0700   master-replica-0        Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 15127 MB memory) -> physical GPU (device: 0, name: Tesla P100-PCIE-16GB, pci bus id: 0000:00:04.0, compute capability: 6.0)
INFO    2018-10-15 10:32:06 -0700   master-replica-0        Mon Oct 15 17:32:06 2018       
INFO    2018-10-15 10:32:06 -0700   master-replica-0        +-----------------------------------------------------------------------------+
INFO    2018-10-15 10:32:06 -0700   master-replica-0        | NVIDIA-SMI 396.26                 Driver Version: 396.26                    |
INFO    2018-10-15 10:32:06 -0700   master-replica-0        |-------------------------------+----------------------+----------------------+
INFO    2018-10-15 10:32:06 -0700   master-replica-0        | GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
INFO    2018-10-15 10:32:06 -0700   master-replica-0        | Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
INFO    2018-10-15 10:32:06 -0700   master-replica-0        |===============================+======================+======================|
INFO    2018-10-15 10:32:06 -0700   master-replica-0        |   0  Tesla P100-PCIE...  Off  | 00000000:00:04.0 Off |                    0 |
INFO    2018-10-15 10:32:06 -0700   master-replica-0        | N/A   46C    P0   172W / 250W |  15619MiB / 16280MiB |     82%      Default |
INFO    2018-10-15 10:32:06 -0700   master-replica-0        +-------------------------------+----------------------+----------------------+
INFO    2018-10-15 10:32:06 -0700   master-replica-0                                                                                       
INFO    2018-10-15 10:32:06 -0700   master-replica-0        +-----------------------------------------------------------------------------+
INFO    2018-10-15 10:32:06 -0700   master-replica-0        | Processes:                                                       GPU Memory |
INFO    2018-10-15 10:32:06 -0700   master-replica-0        |  GPU       PID   Type   Process name                             Usage      |
INFO    2018-10-15 10:32:06 -0700   master-replica-0        |=============================================================================|
INFO    2018-10-15 10:32:06 -0700   master-replica-0        +-----------------------------------------------------------------------------+
INFO    2018-10-15 10:37:06 -0700   master-replica-0        Mon Oct 15 17:37:06 2018       
INFO    2018-10-15 10:37:06 -0700   master-replica-0        +-----------------------------------------------------------------------------+
INFO    2018-10-15 10:37:06 -0700   master-replica-0        | NVIDIA-SMI 396.26                 Driver Version: 396.26                    |
INFO    2018-10-15 10:37:06 -0700   master-replica-0        |-------------------------------+----------------------+----------------------+
INFO    2018-10-15 10:37:06 -0700   master-replica-0        | GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
INFO    2018-10-15 10:37:06 -0700   master-replica-0        | Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
INFO    2018-10-15 10:37:06 -0700   master-replica-0        |===============================+======================+======================|
INFO    2018-10-15 10:37:06 -0700   master-replica-0        |   0  Tesla P100-PCIE...  Off  | 00000000:00:04.0 Off |                    0 |
INFO    2018-10-15 10:37:06 -0700   master-replica-0        | N/A   52C    P0    39W / 250W |  15619MiB / 16280MiB |     33%      Default |
INFO    2018-10-15 10:37:06 -0700   master-replica-0        +-------------------------------+----------------------+----------------------+
INFO    2018-10-15 10:37:06 -0700   master-replica-0                                                                                       
INFO    2018-10-15 10:37:06 -0700   master-replica-0        +-----------------------------------------------------------------------------+
INFO    2018-10-15 10:37:06 -0700   master-replica-0        | Processes:                                                       GPU Memory |
INFO    2018-10-15 10:37:06 -0700   master-replica-0        |  GPU       PID   Type   Process name                             Usage      |
INFO    2018-10-15 10:37:06 -0700   master-replica-0        |=============================================================================|
INFO    2018-10-15 10:37:06 -0700   master-replica-0        +-----------------------------------------------------------------------------+
INFO    2018-10-15 10:38:36 -0700   master-replica-0        Setting Parameters
INFO    2018-10-15 10:38:36 -0700   master-replica-0        get_personlab: Create data source
INFO    2018-10-15 10:38:36 -0700   master-replica-0        get_personlab: Parse tfrecords
INFO    2018-10-15 10:38:36 -0700   master-replica-0        get_personlab: Apply transformations
INFO    2018-10-15 10:38:36 -0700   master-replica-0        get_personlab: Parametrize Dataset
INFO    2018-10-15 10:38:36 -0700   master-replica-0        Build Model
INFO    2018-10-15 10:38:36 -0700   master-replica-0        get_personlab: Define input sizes to Keras tensors and assign image tensor
INFO    2018-10-15 10:38:36 -0700   master-replica-0        get_personlab: Resnet
INFO    2018-10-15 10:38:36 -0700   master-replica-0        Tensor("pool1/MaxPool:0", shape=(?, 99, 99, 64), dtype=float32)
INFO    2018-10-15 10:38:36 -0700   master-replica-0        Tensor("res2a_relu/Relu:0", shape=(?, 99, 99, 256), dtype=float32)
INFO    2018-10-15 10:38:36 -0700   master-replica-0        Tensor("res2b_relu/Relu:0", shape=(?, 99, 99, 256), dtype=float32)
INFO    2018-10-15 10:38:36 -0700   master-replica-0        Tensor("res2c_relu/Relu:0", shape=(?, 99, 99, 256), dtype=float32)
INFO    2018-10-15 10:38:36 -0700   master-replica-0        Tensor("res3a_relu/Relu:0", shape=(?, 50, 50, 512), dtype=float32)
INFO    2018-10-15 10:38:36 -0700   master-replica-0        Tensor("res3b1_relu/Relu:0", shape=(?, 50, 50, 512), dtype=float32)
INFO    2018-10-15 10:38:36 -0700   master-replica-0        Tensor("res3b2_relu/Relu:0", shape=(?, 50, 50, 512), dtype=float32)
INFO    2018-10-15 10:38:36 -0700   master-replica-0        Tensor("res4a_relu/Relu:0", shape=(?, 25, 25, 1024), dtype=float32)
INFO    2018-10-15 10:38:36 -0700   master-replica-0        Tensor("res4b1_relu/Relu:0", shape=(?, 25, 25, 1024), dtype=float32)
INFO    2018-10-15 10:38:36 -0700   master-replica-0        Tensor("res4b2_relu/Relu:0", shape=(?, 25, 25, 1024), dtype=float32)
INFO    2018-10-15 10:38:36 -0700   master-replica-0        Tensor("res4b3_relu/Relu:0", shape=(?, 25, 25, 1024), dtype=float32)
INFO    2018-10-15 10:38:36 -0700   master-replica-0        Tensor("res4b4_relu/Relu:0", shape=(?, 25, 25, 1024), dtype=float32)
INFO    2018-10-15 10:38:36 -0700   master-replica-0        Tensor("res4b5_relu/Relu:0", shape=(?, 25, 25, 1024), dtype=float32)
INFO    2018-10-15 10:38:36 -0700   master-replica-0        Tensor("res4b6_relu/Relu:0", shape=(?, 25, 25, 1024), dtype=float32)
INFO    2018-10-15 10:38:36 -0700   master-replica-0        Tensor("res4b7_relu/Relu:0", shape=(?, 25, 25, 1024), dtype=float32)
INFO    2018-10-15 10:38:36 -0700   master-replica-0        Tensor("res4b8_relu/Relu:0", shape=(?, 25, 25, 1024), dtype=float32)
INFO    2018-10-15 10:38:36 -0700   master-replica-0        Tensor("res4b9_relu/Relu:0", shape=(?, 25, 25, 1024), dtype=float32)
INFO    2018-10-15 10:38:36 -0700   master-replica-0        Tensor("res4b10_relu/Relu:0", shape=(?, 25, 25, 1024), dtype=float32)
INFO    2018-10-15 10:38:36 -0700   master-replica-0        Tensor("res4b11_relu/Relu:0", shape=(?, 25, 25, 1024), dtype=float32)
INFO    2018-10-15 10:38:36 -0700   master-replica-0        Tensor("res4b12_relu/Relu:0", shape=(?, 25, 25, 1024), dtype=float32)
INFO    2018-10-15 10:38:36 -0700   master-replica-0        Tensor("res4b13_relu/Relu:0", shape=(?, 25, 25, 1024), dtype=float32)
INFO    2018-10-15 10:38:36 -0700   master-replica-0        Tensor("res4b14_relu/Relu:0", shape=(?, 25, 25, 1024), dtype=float32)
INFO    2018-10-15 10:38:36 -0700   master-replica-0        Tensor("res4b15_relu/Relu:0", shape=(?, 25, 25, 1024), dtype=float32)
INFO    2018-10-15 10:38:36 -0700   master-replica-0        Tensor("res4b16_relu/Relu:0", shape=(?, 25, 25, 1024), dtype=float32)
INFO    2018-10-15 10:38:36 -0700   master-replica-0        Tensor("res4b17_relu/Relu:0", shape=(?, 25, 25, 1024), dtype=float32)
INFO    2018-10-15 10:38:36 -0700   master-replica-0        Tensor("res4b18_relu/Relu:0", shape=(?, 25, 25, 1024), dtype=float32)
INFO    2018-10-15 10:38:36 -0700   master-replica-0        Tensor("res4b19_relu/Relu:0", shape=(?, 25, 25, 1024), dtype=float32)
INFO    2018-10-15 10:38:36 -0700   master-replica-0        Tensor("res4b20_relu/Relu:0", shape=(?, 25, 25, 1024), dtype=float32)
INFO    2018-10-15 10:38:36 -0700   master-replica-0        Tensor("res4b21_relu/Relu:0", shape=(?, 25, 25, 1024), dtype=float32)
INFO    2018-10-15 10:38:36 -0700   master-replica-0        Tensor("res4b22_relu/Relu:0", shape=(?, 25, 25, 1024), dtype=float32)
INFO    2018-10-15 10:38:36 -0700   master-replica-0        Tensor("res5a_relu/Relu:0", shape=(?, 25, 25, 2048), dtype=float32)
INFO    2018-10-15 10:38:36 -0700   master-replica-0        Tensor("res5b_relu/Relu:0", shape=(?, 25, 25, 2048), dtype=float32)
INFO    2018-10-15 10:38:36 -0700   master-replica-0        Tensor("res5c_relu/Relu:0", shape=(?, 25, 25, 2048), dtype=float32)
INFO    2018-10-15 10:38:36 -0700   master-replica-0        get_personlab: PersonLab Head
INFO    2018-10-15 10:38:36 -0700   master-replica-0        build_personlab_head: Add kp_maps
INFO    2018-10-15 10:38:36 -0700   master-replica-0        Tensor("kp_maps/Sigmoid:0", shape=(?, 25, 25, 2048), dtype=float32)
INFO    2018-10-15 10:38:36 -0700   master-replica-0        build_personlab_head: Add short_offsets
INFO    2018-10-15 10:38:36 -0700   master-replica-0        Tensor("short_offsets/BiasAdd:0", shape=(?, 25, 25, 2048), dtype=float32)
INFO    2018-10-15 10:38:36 -0700   master-replica-0        build_personlab_head: Add mid_offsets
INFO    2018-10-15 10:38:36 -0700   master-replica-0        Tensor("mid_offsets/BiasAdd:0", shape=(?, 25, 25, 2048), dtype=float32)
INFO    2018-10-15 10:38:36 -0700   master-replica-0        Tensor("kp_maps_tConv1/BiasAdd:0", shape=(?, 50, 50, 1048), dtype=float32)
INFO    2018-10-15 10:38:36 -0700   master-replica-0        Tensor("kp_maps_tConv2/BiasAdd:0", shape=(?, 100, 100, 512), dtype=float32)
INFO    2018-10-15 10:38:36 -0700   master-replica-0        Tensor("kp_maps_tConv3/BiasAdd:0", shape=(?, 200, 200, 256), dtype=float32)
INFO    2018-10-15 10:38:36 -0700   master-replica-0        Tensor("kp_maps_tConv3_1/BiasAdd:0", shape=(?, 400, 400, 17), dtype=float32)
INFO    2018-10-15 10:38:36 -0700   master-replica-0        Tensor("short_offsets_tConv1/BiasAdd:0", shape=(?, 50, 50, 1048), dtype=float32)
INFO    2018-10-15 10:38:36 -0700   master-replica-0        Tensor("short_offsets_tConv2/BiasAdd:0", shape=(?, 100, 100, 512), dtype=float32)
INFO    2018-10-15 10:38:36 -0700   master-replica-0        Tensor("short_offsets_tConv3/BiasAdd:0", shape=(?, 200, 200, 256), dtype=float32)
INFO    2018-10-15 10:38:36 -0700   master-replica-0        Tensor("short_offsets_tConv3_1/BiasAdd:0", shape=(?, 400, 400, 34), dtype=float32)
INFO    2018-10-15 10:38:36 -0700   master-replica-0        Tensor("mid_offsets_tConv1/BiasAdd:0", shape=(?, 50, 50, 1048), dtype=float32)
INFO    2018-10-15 10:38:36 -0700   master-replica-0        Tensor("mid_offsets_tConv2/BiasAdd:0", shape=(?, 100, 100, 512), dtype=float32)
INFO    2018-10-15 10:38:36 -0700   master-replica-0        Tensor("mid_offsets_tConv3/BiasAdd:0", shape=(?, 200, 200, 256), dtype=float32)
INFO    2018-10-15 10:38:36 -0700   master-replica-0        Tensor("mid_offsets_tConv3_1/BiasAdd:0", shape=(?, 400, 400, 64), dtype=float32)
INFO    2018-10-15 10:38:36 -0700   master-replica-0        Add loss and training operations
INFO    2018-10-15 10:38:36 -0700   master-replica-0        Create Saver object
INFO    2018-10-15 10:38:36 -0700   master-replica-0        Initialize variables
INFO    2018-10-15 10:38:36 -0700   master-replica-0        Training
INFO    2018-10-15 10:38:36 -0700   master-replica-0        ('iteration: ', '0')
INFO    2018-10-15 10:38:36 -0700   master-replica-0        ('iteration: ', '10')
INFO    2018-10-15 10:38:36 -0700   master-replica-0        ('iteration: ', '20')
INFO    2018-10-15 10:38:36 -0700   master-replica-0        ('iteration: ', '30')
INFO    2018-10-15 10:38:36 -0700   master-replica-0        ('iteration: ', '40')
INFO    2018-10-15 10:38:36 -0700   master-replica-0        ('iteration: ', '50')
INFO    2018-10-15 10:38:36 -0700   master-replica-0        ('iteration: ', '60')

CPU and Memory

如何获取有关该错误的更多详细信息?

更新:我正在从Google Cloud Storage提取数据。我关注了https://www.tensorflow.org/performance/datasets_performance

1 个答案:

答案 0 :(得分:0)

我发现速度非常慢。因此,给我的印象是问题出在培训上。我猜想是因为Keras分层,所以分布式培训存在问题。

我将配置更改为使用complex_model_l_gpu并且有效。