我无法使用ML Engine进行培训。训练总是在迭代60左右停止。我使用Keras构建模型层,但是我使用tf.Session
进行训练。
我收到此错误,但没有回溯。
ERROR 2018-10-15 10:31:02 -0700 master-replica-0 name: Tesla P100-PCIE-16GB major: 6 minor: 0 memoryClockRate(GHz): 1.3285
ERROR 2018-10-15 10:31:02 -0700 master-replica-0 pciBusID: 0000:00:04.0
ERROR 2018-10-15 10:31:02 -0700 master-replica-0 totalMemory: 15.90GiB freeMemory: 15.61GiB
我的config.yaml。我尝试了不同的配置。结果相同。
trainingInput:
scaleTier: CUSTOM
masterType: standard_p100
工作提交。
gcloud ml-engine jobs submit training $JOB_NAME --labels="$LABELS" --verbosity='debug' --stream-logs --package-path=./job --module-name=job.task --staging-bucket="$TRAIN_BUCKET" --region=us-central1 --runtime-version 1.10 --config=job/config.yaml
完整日志
INFO 2018-10-15 10:28:37 -0700 service Validating job requirements...
INFO 2018-10-15 10:28:38 -0700 service Job creation request has been successfully validated.
INFO 2018-10-15 10:28:38 -0700 service Job <JOB_NAME> is queued.
INFO 2018-10-15 10:28:38 -0700 service Waiting for job to be provisioned.
INFO 2018-10-15 10:28:41 -0700 service Waiting for training program to start.
INFO 2018-10-15 10:30:03 -0700 master-replica-0 Running task with arguments: --cluster={"master": ["127.0.0.1:2222"]} --task={"type": "master", "index": 0} --job={ "scale_tier": "CUSTOM", "master_type": "standard_p100", "package_uris": ["gs://annotator-1286-ml/<JOB_NAME>/5b038627d10c914d6309269cefff8d2e0682f87f441bdb8c547a05e8ed1107a7/job-0.0.0.tar.gz"], "python_module": "job.task", "region": "us-central1", "runtime_version": "1.10", "run_on_raw_vm": true}
INFO 2018-10-15 10:30:15 -0700 master-replica-0 Running module job.task.
INFO 2018-10-15 10:30:15 -0700 master-replica-0 Downloading the package: gs://annotator-1286-ml/<JOB_NAME>/5b038627d10c914d6309269cefff8d2e0682f87f441bdb8c547a05e8ed1107a7/job-0.0.0.tar.gz
INFO 2018-10-15 10:30:15 -0700 master-replica-0 Running command: gsutil -q cp gs://annotator-1286-ml/<JOB_NAME>/5b038627d10c914d6309269cefff8d2e0682f87f441bdb8c547a05e8ed1107a7/job-0.0.0.tar.gz job-0.0.0.tar.gz
INFO 2018-10-15 10:30:22 -0700 master-replica-0 Installing the package: gs://annotator-1286-ml/<JOB_NAME>/5b038627d10c914d6309269cefff8d2e0682f87f441bdb8c547a05e8ed1107a7/job-0.0.0.tar.gz
INFO 2018-10-15 10:30:22 -0700 master-replica-0 Running command: pip install --user --upgrade --force-reinstall --no-deps job-0.0.0.tar.gz
INFO 2018-10-15 10:30:28 -0700 master-replica-0 Processing ./job-0.0.0.tar.gz
INFO 2018-10-15 10:30:29 -0700 master-replica-0 Building wheels for collected packages: job
INFO 2018-10-15 10:30:29 -0700 master-replica-0 Running setup.py bdist_wheel for job: started
INFO 2018-10-15 10:30:29 -0700 master-replica-0 Running setup.py bdist_wheel for job: finished with status 'done'
INFO 2018-10-15 10:30:29 -0700 master-replica-0 Stored in directory: /root/.cache/pip/wheels/b8/10/df/bb59eda2baac79b36fbdb8e5305ada7d6bf7779be49c3c5a0d
INFO 2018-10-15 10:30:29 -0700 master-replica-0 Successfully built job
INFO 2018-10-15 10:30:29 -0700 master-replica-0 Installing collected packages: job
INFO 2018-10-15 10:30:29 -0700 master-replica-0 Successfully installed job-0.0.0
INFO 2018-10-15 10:30:30 -0700 master-replica-0 Running command: pip install --user job-0.0.0.tar.gz
INFO 2018-10-15 10:30:30 -0700 master-replica-0 Processing ./job-0.0.0.tar.gz
INFO 2018-10-15 10:30:30 -0700 master-replica-0 Building wheels for collected packages: job
INFO 2018-10-15 10:30:30 -0700 master-replica-0 Running setup.py bdist_wheel for job: started
INFO 2018-10-15 10:30:30 -0700 master-replica-0 Running setup.py bdist_wheel for job: finished with status 'done'
INFO 2018-10-15 10:30:30 -0700 master-replica-0 Stored in directory: /root/.cache/pip/wheels/b8/10/df/bb59eda2baac79b36fbdb8e5305ada7d6bf7779be49c3c5a0d
INFO 2018-10-15 10:30:30 -0700 master-replica-0 Successfully built job
INFO 2018-10-15 10:30:31 -0700 master-replica-0 Installing collected packages: job
INFO 2018-10-15 10:30:31 -0700 master-replica-0 Found existing installation: job 0.0.0
INFO 2018-10-15 10:30:31 -0700 master-replica-0 Uninstalling job-0.0.0:
INFO 2018-10-15 10:30:31 -0700 master-replica-0 Successfully uninstalled job-0.0.0
INFO 2018-10-15 10:30:31 -0700 master-replica-0 Successfully installed job-0.0.0
INFO 2018-10-15 10:30:31 -0700 master-replica-0 Running command: python -m job.task
INFO 2018-10-15 10:31:02 -0700 master-replica-0 successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
INFO 2018-10-15 10:31:02 -0700 master-replica-0 Found device 0 with properties:
ERROR 2018-10-15 10:31:02 -0700 master-replica-0 name: Tesla P100-PCIE-16GB major: 6 minor: 0 memoryClockRate(GHz): 1.3285
ERROR 2018-10-15 10:31:02 -0700 master-replica-0 pciBusID: 0000:00:04.0
ERROR 2018-10-15 10:31:02 -0700 master-replica-0 totalMemory: 15.90GiB freeMemory: 15.61GiB
INFO 2018-10-15 10:31:02 -0700 master-replica-0 Adding visible gpu devices: 0
INFO 2018-10-15 10:31:03 -0700 master-replica-0 Device interconnect StreamExecutor with strength 1 edge matrix:
INFO 2018-10-15 10:31:03 -0700 master-replica-0 0
INFO 2018-10-15 10:31:03 -0700 master-replica-0 0: N
INFO 2018-10-15 10:31:03 -0700 master-replica-0 Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 15127 MB memory) -> physical GPU (device: 0, name: Tesla P100-PCIE-16GB, pci bus id: 0000:00:04.0, compute capability: 6.0)
INFO 2018-10-15 10:32:06 -0700 master-replica-0 Mon Oct 15 17:32:06 2018
INFO 2018-10-15 10:32:06 -0700 master-replica-0 +-----------------------------------------------------------------------------+
INFO 2018-10-15 10:32:06 -0700 master-replica-0 | NVIDIA-SMI 396.26 Driver Version: 396.26 |
INFO 2018-10-15 10:32:06 -0700 master-replica-0 |-------------------------------+----------------------+----------------------+
INFO 2018-10-15 10:32:06 -0700 master-replica-0 | GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
INFO 2018-10-15 10:32:06 -0700 master-replica-0 | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
INFO 2018-10-15 10:32:06 -0700 master-replica-0 |===============================+======================+======================|
INFO 2018-10-15 10:32:06 -0700 master-replica-0 | 0 Tesla P100-PCIE... Off | 00000000:00:04.0 Off | 0 |
INFO 2018-10-15 10:32:06 -0700 master-replica-0 | N/A 46C P0 172W / 250W | 15619MiB / 16280MiB | 82% Default |
INFO 2018-10-15 10:32:06 -0700 master-replica-0 +-------------------------------+----------------------+----------------------+
INFO 2018-10-15 10:32:06 -0700 master-replica-0
INFO 2018-10-15 10:32:06 -0700 master-replica-0 +-----------------------------------------------------------------------------+
INFO 2018-10-15 10:32:06 -0700 master-replica-0 | Processes: GPU Memory |
INFO 2018-10-15 10:32:06 -0700 master-replica-0 | GPU PID Type Process name Usage |
INFO 2018-10-15 10:32:06 -0700 master-replica-0 |=============================================================================|
INFO 2018-10-15 10:32:06 -0700 master-replica-0 +-----------------------------------------------------------------------------+
INFO 2018-10-15 10:37:06 -0700 master-replica-0 Mon Oct 15 17:37:06 2018
INFO 2018-10-15 10:37:06 -0700 master-replica-0 +-----------------------------------------------------------------------------+
INFO 2018-10-15 10:37:06 -0700 master-replica-0 | NVIDIA-SMI 396.26 Driver Version: 396.26 |
INFO 2018-10-15 10:37:06 -0700 master-replica-0 |-------------------------------+----------------------+----------------------+
INFO 2018-10-15 10:37:06 -0700 master-replica-0 | GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
INFO 2018-10-15 10:37:06 -0700 master-replica-0 | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
INFO 2018-10-15 10:37:06 -0700 master-replica-0 |===============================+======================+======================|
INFO 2018-10-15 10:37:06 -0700 master-replica-0 | 0 Tesla P100-PCIE... Off | 00000000:00:04.0 Off | 0 |
INFO 2018-10-15 10:37:06 -0700 master-replica-0 | N/A 52C P0 39W / 250W | 15619MiB / 16280MiB | 33% Default |
INFO 2018-10-15 10:37:06 -0700 master-replica-0 +-------------------------------+----------------------+----------------------+
INFO 2018-10-15 10:37:06 -0700 master-replica-0
INFO 2018-10-15 10:37:06 -0700 master-replica-0 +-----------------------------------------------------------------------------+
INFO 2018-10-15 10:37:06 -0700 master-replica-0 | Processes: GPU Memory |
INFO 2018-10-15 10:37:06 -0700 master-replica-0 | GPU PID Type Process name Usage |
INFO 2018-10-15 10:37:06 -0700 master-replica-0 |=============================================================================|
INFO 2018-10-15 10:37:06 -0700 master-replica-0 +-----------------------------------------------------------------------------+
INFO 2018-10-15 10:38:36 -0700 master-replica-0 Setting Parameters
INFO 2018-10-15 10:38:36 -0700 master-replica-0 get_personlab: Create data source
INFO 2018-10-15 10:38:36 -0700 master-replica-0 get_personlab: Parse tfrecords
INFO 2018-10-15 10:38:36 -0700 master-replica-0 get_personlab: Apply transformations
INFO 2018-10-15 10:38:36 -0700 master-replica-0 get_personlab: Parametrize Dataset
INFO 2018-10-15 10:38:36 -0700 master-replica-0 Build Model
INFO 2018-10-15 10:38:36 -0700 master-replica-0 get_personlab: Define input sizes to Keras tensors and assign image tensor
INFO 2018-10-15 10:38:36 -0700 master-replica-0 get_personlab: Resnet
INFO 2018-10-15 10:38:36 -0700 master-replica-0 Tensor("pool1/MaxPool:0", shape=(?, 99, 99, 64), dtype=float32)
INFO 2018-10-15 10:38:36 -0700 master-replica-0 Tensor("res2a_relu/Relu:0", shape=(?, 99, 99, 256), dtype=float32)
INFO 2018-10-15 10:38:36 -0700 master-replica-0 Tensor("res2b_relu/Relu:0", shape=(?, 99, 99, 256), dtype=float32)
INFO 2018-10-15 10:38:36 -0700 master-replica-0 Tensor("res2c_relu/Relu:0", shape=(?, 99, 99, 256), dtype=float32)
INFO 2018-10-15 10:38:36 -0700 master-replica-0 Tensor("res3a_relu/Relu:0", shape=(?, 50, 50, 512), dtype=float32)
INFO 2018-10-15 10:38:36 -0700 master-replica-0 Tensor("res3b1_relu/Relu:0", shape=(?, 50, 50, 512), dtype=float32)
INFO 2018-10-15 10:38:36 -0700 master-replica-0 Tensor("res3b2_relu/Relu:0", shape=(?, 50, 50, 512), dtype=float32)
INFO 2018-10-15 10:38:36 -0700 master-replica-0 Tensor("res4a_relu/Relu:0", shape=(?, 25, 25, 1024), dtype=float32)
INFO 2018-10-15 10:38:36 -0700 master-replica-0 Tensor("res4b1_relu/Relu:0", shape=(?, 25, 25, 1024), dtype=float32)
INFO 2018-10-15 10:38:36 -0700 master-replica-0 Tensor("res4b2_relu/Relu:0", shape=(?, 25, 25, 1024), dtype=float32)
INFO 2018-10-15 10:38:36 -0700 master-replica-0 Tensor("res4b3_relu/Relu:0", shape=(?, 25, 25, 1024), dtype=float32)
INFO 2018-10-15 10:38:36 -0700 master-replica-0 Tensor("res4b4_relu/Relu:0", shape=(?, 25, 25, 1024), dtype=float32)
INFO 2018-10-15 10:38:36 -0700 master-replica-0 Tensor("res4b5_relu/Relu:0", shape=(?, 25, 25, 1024), dtype=float32)
INFO 2018-10-15 10:38:36 -0700 master-replica-0 Tensor("res4b6_relu/Relu:0", shape=(?, 25, 25, 1024), dtype=float32)
INFO 2018-10-15 10:38:36 -0700 master-replica-0 Tensor("res4b7_relu/Relu:0", shape=(?, 25, 25, 1024), dtype=float32)
INFO 2018-10-15 10:38:36 -0700 master-replica-0 Tensor("res4b8_relu/Relu:0", shape=(?, 25, 25, 1024), dtype=float32)
INFO 2018-10-15 10:38:36 -0700 master-replica-0 Tensor("res4b9_relu/Relu:0", shape=(?, 25, 25, 1024), dtype=float32)
INFO 2018-10-15 10:38:36 -0700 master-replica-0 Tensor("res4b10_relu/Relu:0", shape=(?, 25, 25, 1024), dtype=float32)
INFO 2018-10-15 10:38:36 -0700 master-replica-0 Tensor("res4b11_relu/Relu:0", shape=(?, 25, 25, 1024), dtype=float32)
INFO 2018-10-15 10:38:36 -0700 master-replica-0 Tensor("res4b12_relu/Relu:0", shape=(?, 25, 25, 1024), dtype=float32)
INFO 2018-10-15 10:38:36 -0700 master-replica-0 Tensor("res4b13_relu/Relu:0", shape=(?, 25, 25, 1024), dtype=float32)
INFO 2018-10-15 10:38:36 -0700 master-replica-0 Tensor("res4b14_relu/Relu:0", shape=(?, 25, 25, 1024), dtype=float32)
INFO 2018-10-15 10:38:36 -0700 master-replica-0 Tensor("res4b15_relu/Relu:0", shape=(?, 25, 25, 1024), dtype=float32)
INFO 2018-10-15 10:38:36 -0700 master-replica-0 Tensor("res4b16_relu/Relu:0", shape=(?, 25, 25, 1024), dtype=float32)
INFO 2018-10-15 10:38:36 -0700 master-replica-0 Tensor("res4b17_relu/Relu:0", shape=(?, 25, 25, 1024), dtype=float32)
INFO 2018-10-15 10:38:36 -0700 master-replica-0 Tensor("res4b18_relu/Relu:0", shape=(?, 25, 25, 1024), dtype=float32)
INFO 2018-10-15 10:38:36 -0700 master-replica-0 Tensor("res4b19_relu/Relu:0", shape=(?, 25, 25, 1024), dtype=float32)
INFO 2018-10-15 10:38:36 -0700 master-replica-0 Tensor("res4b20_relu/Relu:0", shape=(?, 25, 25, 1024), dtype=float32)
INFO 2018-10-15 10:38:36 -0700 master-replica-0 Tensor("res4b21_relu/Relu:0", shape=(?, 25, 25, 1024), dtype=float32)
INFO 2018-10-15 10:38:36 -0700 master-replica-0 Tensor("res4b22_relu/Relu:0", shape=(?, 25, 25, 1024), dtype=float32)
INFO 2018-10-15 10:38:36 -0700 master-replica-0 Tensor("res5a_relu/Relu:0", shape=(?, 25, 25, 2048), dtype=float32)
INFO 2018-10-15 10:38:36 -0700 master-replica-0 Tensor("res5b_relu/Relu:0", shape=(?, 25, 25, 2048), dtype=float32)
INFO 2018-10-15 10:38:36 -0700 master-replica-0 Tensor("res5c_relu/Relu:0", shape=(?, 25, 25, 2048), dtype=float32)
INFO 2018-10-15 10:38:36 -0700 master-replica-0 get_personlab: PersonLab Head
INFO 2018-10-15 10:38:36 -0700 master-replica-0 build_personlab_head: Add kp_maps
INFO 2018-10-15 10:38:36 -0700 master-replica-0 Tensor("kp_maps/Sigmoid:0", shape=(?, 25, 25, 2048), dtype=float32)
INFO 2018-10-15 10:38:36 -0700 master-replica-0 build_personlab_head: Add short_offsets
INFO 2018-10-15 10:38:36 -0700 master-replica-0 Tensor("short_offsets/BiasAdd:0", shape=(?, 25, 25, 2048), dtype=float32)
INFO 2018-10-15 10:38:36 -0700 master-replica-0 build_personlab_head: Add mid_offsets
INFO 2018-10-15 10:38:36 -0700 master-replica-0 Tensor("mid_offsets/BiasAdd:0", shape=(?, 25, 25, 2048), dtype=float32)
INFO 2018-10-15 10:38:36 -0700 master-replica-0 Tensor("kp_maps_tConv1/BiasAdd:0", shape=(?, 50, 50, 1048), dtype=float32)
INFO 2018-10-15 10:38:36 -0700 master-replica-0 Tensor("kp_maps_tConv2/BiasAdd:0", shape=(?, 100, 100, 512), dtype=float32)
INFO 2018-10-15 10:38:36 -0700 master-replica-0 Tensor("kp_maps_tConv3/BiasAdd:0", shape=(?, 200, 200, 256), dtype=float32)
INFO 2018-10-15 10:38:36 -0700 master-replica-0 Tensor("kp_maps_tConv3_1/BiasAdd:0", shape=(?, 400, 400, 17), dtype=float32)
INFO 2018-10-15 10:38:36 -0700 master-replica-0 Tensor("short_offsets_tConv1/BiasAdd:0", shape=(?, 50, 50, 1048), dtype=float32)
INFO 2018-10-15 10:38:36 -0700 master-replica-0 Tensor("short_offsets_tConv2/BiasAdd:0", shape=(?, 100, 100, 512), dtype=float32)
INFO 2018-10-15 10:38:36 -0700 master-replica-0 Tensor("short_offsets_tConv3/BiasAdd:0", shape=(?, 200, 200, 256), dtype=float32)
INFO 2018-10-15 10:38:36 -0700 master-replica-0 Tensor("short_offsets_tConv3_1/BiasAdd:0", shape=(?, 400, 400, 34), dtype=float32)
INFO 2018-10-15 10:38:36 -0700 master-replica-0 Tensor("mid_offsets_tConv1/BiasAdd:0", shape=(?, 50, 50, 1048), dtype=float32)
INFO 2018-10-15 10:38:36 -0700 master-replica-0 Tensor("mid_offsets_tConv2/BiasAdd:0", shape=(?, 100, 100, 512), dtype=float32)
INFO 2018-10-15 10:38:36 -0700 master-replica-0 Tensor("mid_offsets_tConv3/BiasAdd:0", shape=(?, 200, 200, 256), dtype=float32)
INFO 2018-10-15 10:38:36 -0700 master-replica-0 Tensor("mid_offsets_tConv3_1/BiasAdd:0", shape=(?, 400, 400, 64), dtype=float32)
INFO 2018-10-15 10:38:36 -0700 master-replica-0 Add loss and training operations
INFO 2018-10-15 10:38:36 -0700 master-replica-0 Create Saver object
INFO 2018-10-15 10:38:36 -0700 master-replica-0 Initialize variables
INFO 2018-10-15 10:38:36 -0700 master-replica-0 Training
INFO 2018-10-15 10:38:36 -0700 master-replica-0 ('iteration: ', '0')
INFO 2018-10-15 10:38:36 -0700 master-replica-0 ('iteration: ', '10')
INFO 2018-10-15 10:38:36 -0700 master-replica-0 ('iteration: ', '20')
INFO 2018-10-15 10:38:36 -0700 master-replica-0 ('iteration: ', '30')
INFO 2018-10-15 10:38:36 -0700 master-replica-0 ('iteration: ', '40')
INFO 2018-10-15 10:38:36 -0700 master-replica-0 ('iteration: ', '50')
INFO 2018-10-15 10:38:36 -0700 master-replica-0 ('iteration: ', '60')
如何获取有关该错误的更多详细信息?
更新:我正在从Google Cloud Storage提取数据。我关注了https://www.tensorflow.org/performance/datasets_performance
。
答案 0 :(得分:0)
我发现速度非常慢。因此,给我的印象是问题出在培训上。我猜想是因为Keras分层,所以分布式培训存在问题。
我将配置更改为使用complex_model_l_gpu
并且有效。