MobilenetSSDv2冻结转移学习

时间:2019-10-02 18:49:21

标签: python tensorflow object-detection-api

我正在使用Mobilenet-SSD-v2训练模型,它会训练一段时间,然后尝试评估并冻结。

我正在tensorflow/tensorflow:latest-gpu码头工人镜像中运行tensorflow-gpu 1.14。我正在Ubuntu 19.04上使用RTX 2060。我正在使用来自此git repo的最新对象检测API:https://github.com/tensorflow/models

我试图在model_lib.py中设置节流阀_秒,它什么也没做。我仍然可以训练,但是每次尝试撤离时,我都需要重新启动Docker容器。

我仅使用git repo提供的代码。我使用下面的命令开始训练。

PIPELINE_CONFIG_PATH=/tensorflow/models/research/face/pipeline.config
MODEL_DIR=/tensorflow/models/research/face/training/
NUM_TRAIN_STEPS=50000
SAMPLE_1_OF_N_EVAL_EXAMPLES=1
python object_detection/model_main.py \
    --pipeline_config_path=${PIPELINE_CONFIG_PATH} \
    --model_dir=${MODEL_DIR} \
    --num_train_steps=${NUM_TRAIN_STEPS} \
    --sample_1_of_n_eval_examples=$SAMPLE_1_OF_N_EVAL_EXAMPLES \
    --alsologtostderr

我希望它会继续训练。但是相反,我只是卡住了,需要重新启动。

I1002 18:28:30.106040 139663203059520 evaluation.py:255] Starting evaluation at 2019-10-02T18:28:30Z
I1002 18:28:30.717183 139663203059520 monitored_session.py:240] Graph was finalized.
2019-10-02 18:28:30.717937: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:1005] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2019-10-02 18:28:30.718182: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1640] Found device 0 with properties: 
name: GeForce RTX 2060 major: 7 minor: 5 memoryClockRate(GHz): 1.71
pciBusID: 0000:0a:00.0
2019-10-02 18:28:30.718232: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcudart.so.10.0
2019-10-02 18:28:30.718251: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcublas.so.10.0
2019-10-02 18:28:30.718263: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcufft.so.10.0
2019-10-02 18:28:30.718279: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcurand.so.10.0
2019-10-02 18:28:30.718295: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcusolver.so.10.0
2019-10-02 18:28:30.718309: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcusparse.so.10.0
2019-10-02 18:28:30.718326: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcudnn.so.7
2019-10-02 18:28:30.718401: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:1005] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2019-10-02 18:28:30.718655: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:1005] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2019-10-02 18:28:30.718861: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1763] Adding visible gpu devices: 0
2019-10-02 18:28:30.718888: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1181] Device interconnect StreamExecutor with strength 1 edge matrix:
2019-10-02 18:28:30.718898: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1187]      0 
2019-10-02 18:28:30.718907: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1200] 0:   N 
2019-10-02 18:28:30.718992: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:1005] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2019-10-02 18:28:30.719242: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:1005] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2019-10-02 18:28:30.719460: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1326] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 4946 MB memory) -> physical GPU (device: 0, name: GeForce RTX 2060, pci bus id: 0000:0a:00.0, compute capability: 7.5)
I1002 18:28:30.720419 139663203059520 saver.py:1280] Restoring parameters from /tensorflow/models/research/face/training/model.ckpt-10756
I1002 18:28:32.285661 139663203059520 session_manager.py:500] Running local_init_op.
I1002 18:28:32.408489 139663203059520 session_manager.py:502] Done running local_init_op.

1 个答案:

答案 0 :(得分:0)

我遇到了像6-7个月前一样的相同问题,但找不到解决方案。但是,我试图从头开始创建一个新的环境。我的工作环境的详细信息在下面列出。

# Name                    Version                   Build  Channel
absl-py                   0.8.0                    pypi_0    pypi
astor                     0.8.0                    pypi_0    pypi
bleach                    1.5.0                    pypi_0    pypi
certifi                   2018.8.24                py35_1    anaconda
contextlib2               0.5.5                    pypi_0    pypi
cycler                    0.10.0                   pypi_0    pypi
cython                    0.29.13                  pypi_0    pypi
gast                      0.3.2                    pypi_0    pypi
grpcio                    1.23.0                   pypi_0    pypi
html5lib                  0.9999999                pypi_0    pypi
kiwisolver                1.1.0                    pypi_0    pypi
libprotobuf               3.6.0                h1a1b453_0    anaconda
lxml                      4.4.1                    pypi_0    pypi
markdown                  3.1.1                    pypi_0    pypi
matplotlib                3.0.3                    pypi_0    pypi
numpy                     1.17.2                   pypi_0    pypi
opencv-python             4.1.1.26                 pypi_0    pypi
pandas                    0.25.1                   pypi_0    pypi
pillow                    6.1.0                    pypi_0    pypi
pip                       19.2.3                   pypi_0    pypi
protobuf                  3.9.1                    pypi_0    pypi
pyparsing                 2.4.2                    pypi_0    pypi
python                    3.5.6                he025d50_0
python-dateutil           2.8.0                    pypi_0    pypi
pytz                      2019.2                   pypi_0    pypi
setuptools                41.2.0                   pypi_0    pypi
six                       1.12.0                   pypi_0    pypi
tensorboard               1.8.0                    pypi_0    pypi
tensorflow-gpu            1.8.0                    pypi_0    pypi
termcolor                 1.1.0                    pypi_0    pypi
vc                        14.1                 h21ff451_3    anaconda
vs2015_runtime            15.5.2                        3    anaconda
werkzeug                  0.15.6                   pypi_0    pypi
wheel                     0.33.6                   pypi_0    pypi
wincertstore              0.2              py35hfebbdb8_0
zlib                      1.2.11               h62dcd97_3    anaconda

注意:

您可以区分我的python版本是3.5,这是我的家用电脑。我的工作电脑具有与python 3.6.8完全相同的软件包。因此,这也适用于3.6。

此外,我相信tensorflow/models在某种程度上适用于先前版本的tensorflow,如您所见,我的版本是1.8.0。遇到相同问题时,我使用的是1.13。

我希望它能解决。