我从头开始训练CNN
VGG_face
caffe
模型,并使用HDF5
格式保存了训练快照,显然训练到迭代3000没有问题。
solver.prototxt
net: "models/Custom_Model/train.prototxt"
# test_iter specifies how many forward passes the test should carry out
test_iter: 1
# Carry out testing every X training iterations
test_interval: 20
# Learning rate and momentum parameters for Adam
base_lr: 0.001
momentum: 0.9
momentum2: 0.999
# Adam takes care of changing the learning rate
lr_policy: "fixed"
# Display every X iterations
display: 10
# The maximum number of iterations
max_iter: 3000
# snapshot intermediate results
snapshot: 100
snapshot_prefix: "snapshots/"
snapshot_format: HDF5
# solver mode: CPU or GPU
type: "Adam"
solver_mode: CPU
出了点问题,因为当我尝试加载网络以测试分类时,就像这样:
net = caffe.Net('models/my_face/deploy.prototxt', 'models/my_face/_iter_3000.solverstate.h5', caffe.TEST)
net.save('models/my_face/my_face.caffemodel')
我收到以下错误:
HDF5-DIAG: Error detected in HDF5 (1.8.11) thread 139907437496128:
#000: ../../../src/H5G.c line 463 in H5Gopen2(): unable to open group
major: Symbol table
minor: Can't open object
#001: ../../../src/H5Gint.c line 320 in H5G__open_name(): group not found
major: Symbol table
minor: Object not found
#002: ../../../src/H5Gloc.c line 430 in H5G_loc_find(): can't find object
major: Symbol table
minor: Object not found
#003: ../../../src/H5Gtraverse.c line 861 in H5G_traverse(): internal path traversal failed
major: Symbol table
minor: Object not found
#004: ../../../src/H5Gtraverse.c line 641 in H5G_traverse_real(): traversal operator failed
major: Symbol table
minor: Callback failed
#005: ../../../src/H5Gloc.c line 385 in H5G_loc_find_cb(): object 'data' doesn't exist
major: Symbol table
minor: Object not found
F0608 18:00:59.386113 154 net.cpp:802] Check failed: data_hid >= 0 (-1 vs. 0) Error reading weights from models/vitor_face/_iter_3000.solverstate.h5
*** Check failure stack trace: ***
Aborted
deploy.prototxt
name: "VGG_FACE_16_layers"
input: "data"
input_dim: 1
input_dim: 3
input_dim: 224
input_dim: 224
layers {
name: "conv1"
type: CONVOLUTION
bottom: "data"
top: "conv1"
convolution_param {
num_output: 96
kernel_size: 7
stride: 2
}
}
layers {
name: "relu1"
type: RELU
bottom: "conv1"
top: "conv1"
}
layers {
name: "norm1"
type: LRN
bottom: "conv1"
top: "norm1"
lrn_param {
local_size: 5
alpha: 0.0005
beta: 0.75
}
}
layers {
name: "pool1"
type: POOLING
bottom: "norm1"
top: "pool1"
pooling_param {
pool: MAX
kernel_size: 3
stride: 3
}
}
layers {
name: "conv2"
type: CONVOLUTION
bottom: "pool1"
top: "conv2"
convolution_param {
num_output: 256
pad: 2
kernel_size: 5
}
}
layers {
name: "relu2"
type: RELU
bottom: "conv2"
top: "conv2"
}
layers {
name: "pool2"
type: POOLING
bottom: "conv2"
top: "pool2"
pooling_param {
pool: MAX
kernel_size: 2
stride: 2
}
}
layers {
name: "conv3"
type: CONVOLUTION
bottom: "pool2"
top: "conv3"
convolution_param {
num_output: 512
pad: 1
kernel_size: 3
}
}
layers {
name: "relu3"
type: RELU
bottom: "conv3"
top: "conv3"
}
layers {
name: "conv4"
type: CONVOLUTION
bottom: "conv3"
top: "conv4"
convolution_param {
num_output: 512
pad: 1
kernel_size: 3
}
}
layers {
name: "relu4"
type: RELU
bottom: "conv4"
top: "conv4"
}
layers {
name: "conv5"
type: CONVOLUTION
bottom: "conv4"
top: "conv5"
convolution_param {
num_output: 512
pad: 1
kernel_size: 3
}
}
layers {
name: "relu5"
type: RELU
bottom: "conv5"
top: "conv5"
}
layers {
name: "pool5"
type: POOLING
bottom: "conv5"
top: "pool5"
pooling_param {
pool: MAX
kernel_size: 3
stride: 3
}
}
layers {
name: "fc6"
type: INNER_PRODUCT
bottom: "pool5"
top: "fc6"
inner_product_param {
num_output: 4048
}
}
layers {
name: "relu6"
type: RELU
bottom: "fc6"
top: "fc6"
}
layers {
name: "drop6"
type: DROPOUT
bottom: "fc6"
top: "fc6"
dropout_param {
dropout_ratio: 0.5
}
}
layers {
name: "fc7"
type: INNER_PRODUCT
bottom: "fc6"
top: "fc7"
inner_product_param {
num_output: 4048
}
}
layers {
name: "relu7"
type: RELU
bottom: "fc7"
top: "fc7"
}
layers {
name: "drop7"
type: DROPOUT
bottom: "fc7"
top: "fc7"
dropout_param {
dropout_ratio: 0.5
}
}
layers {
name: "fc8_cat"
type: INNER_PRODUCT
bottom: "fc7"
top: "fc8"
inner_product_param {
num_output: 6
}
}
layers {
name: "prob"
type: SOFTMAX
bottom: "fc8"
top: "prob"
}
为了调试这个问题,我考虑了一些net_surgery
,这就是我的架构打印出来的:
blobs ['data', 'conv1', 'norm1', 'pool1', 'conv2', 'pool2', 'conv3', 'conv4', 'conv5', 'pool5', 'fc6', 'fc7', 'fc8', 'prob']
params ['conv1', 'conv2', 'conv3', 'conv4', 'conv5', 'fc6', 'fc7', 'fc8_cat']
('POOL 5', (512, 7, 7))
Layer Name : conv1, Weight Dims :(96, 3, 7, 7)
Layer Name : conv2, Weight Dims :(256, 96, 5, 5)
Layer Name : conv3, Weight Dims :(512, 256, 3, 3)
Layer Name : conv4, Weight Dims :(512, 512, 3, 3)
Layer Name : conv5, Weight Dims :(512, 512, 3, 3)
Layer Name : fc6, Weight Dims :(4048, 25088)
Layer Name : fc7, Weight Dims :(4048, 4048)
Layer Name : fc8_cat, Weight Dims :(6, 4048)
fc6 weights are (4048, 25088) dimensional and biases are (4048,) dimensional
fc7 weights are (4048, 4048) dimensional and biases are (4048,) dimensional
fc8_cat weights are (6, 4048) dimensional and biases are (6,) dimensional
train.prototxt
name: "CaffeNet"
layers {
name: "training_train"
type: DATA
data_param {
source: "datasets/training_set_lmdb"
backend: LMDB
batch_size: 10
}
transform_param{
mean_file: "datasets/mean_training_image.binaryproto"
}
top: "data"
top: "label"
include {
phase: TRAIN
}
}
layers {
name: "training_test"
type: DATA
data_param {
source: "datasets/validation_set_lmdb"
backend: LMDB
batch_size: 1
}
transform_param{
mean_file: "datasets/mean_training_image.binaryproto"
}
top: "data"
top: "label"
include {
phase: TEST
}
}
layers {
name: "conv1"
type: CONVOLUTION
bottom: "data"
top: "conv1"
convolution_param {
num_output: 96
kernel_size: 7
stride: 2
}
blobs_lr: 0
blobs_lr: 0
}
layers {
name: "relu1"
type: RELU
bottom: "conv1"
top: "conv1"
}
layers {
name: "norm1"
type: LRN
bottom: "conv1"
top: "norm1"
lrn_param {
local_size: 5
alpha: 0.0005
beta: 0.75
}
}
layers {
name: "pool1"
type: POOLING
bottom: "norm1"
top: "pool1"
pooling_param {
pool: MAX
kernel_size: 3
stride: 3
}
}
layers {
name: "conv2"
type: CONVOLUTION
bottom: "pool1"
top: "conv2"
convolution_param {
num_output: 256
pad: 2
kernel_size: 5
}
blobs_lr: 0
blobs_lr: 0
}
layers {
name: "relu2"
type: RELU
bottom: "conv2"
top: "conv2"
}
layers {
name: "pool2"
type: POOLING
bottom: "conv2"
top: "pool2"
pooling_param {
pool: MAX
kernel_size: 2
stride: 2
}
}
layers {
name: "conv3"
type: CONVOLUTION
bottom: "pool2"
top: "conv3"
convolution_param {
num_output: 512
pad: 1
kernel_size: 3
}
blobs_lr: 0
blobs_lr: 0
}
layers {
name: "relu3"
type: RELU
bottom: "conv3"
top: "conv3"
}
layers {
name: "conv4"
type: CONVOLUTION
bottom: "conv3"
top: "conv4"
convolution_param {
num_output: 512
pad: 1
kernel_size: 3
}
blobs_lr: 0
blobs_lr: 0
}
layers {
name: "relu4"
type: RELU
bottom: "conv4"
top: "conv4"
}
layers {
name: "conv5"
type: CONVOLUTION
bottom: "conv4"
top: "conv5"
convolution_param {
num_output: 512
pad: 1
kernel_size: 3
}
blobs_lr: 0
blobs_lr: 0
}
layers {
name: "relu5"
type: RELU
bottom: "conv5"
top: "conv5"
}
layers {
name: "pool5"
type: POOLING
bottom: "conv5"
top: "pool5"
pooling_param {
pool: MAX
kernel_size: 3
stride: 3
}
}
layers {
name: "fc6"
type: INNER_PRODUCT
bottom: "pool5"
top: "fc6"
inner_product_param {
num_output: 4048
}
blobs_lr: 1.0
blobs_lr: 1.0
}
layers {
name: "relu6"
type: RELU
bottom: "fc6"
top: "fc6"
}
layers {
name: "drop6"
type: DROPOUT
bottom: "fc6"
top: "fc6"
dropout_param {
dropout_ratio: 0.5
}
}
layers {
name: "fc7"
type: INNER_PRODUCT
bottom: "fc6"
top: "fc7"
inner_product_param {
num_output: 4048
}
blobs_lr: 1.0
blobs_lr: 1.0
}
layers {
name: "relu7"
type: RELU
bottom: "fc7"
top: "fc7"
}
layers {
name: "drop7"
type: DROPOUT
bottom: "fc7"
top: "fc7"
dropout_param {
dropout_ratio: 0.5
}
}
layers {
name: "fc8_cat"
type: INNER_PRODUCT
bottom: "fc7"
top: "fc8"
inner_product_param {
num_output: 6
}
blobs_lr: 1.0
blobs_lr: 1.0
}
layers {
name: "prob"
type: SOFTMAX_LOSS
bottom: "fc8"
bottom: "label"
}
如何解决此问题?