我有一台带有两个Nvidia GTX 1080(驱动程序384.111)的服务器。目前,我通过调整TF配置并将会话传递给Keras来并行计算两个更大的CNN模型,每个GPU一个(Tensorflow后端):
import tensorflow as tf
from keras.backend.tensorflow_backend import set_session
config = tf.ConfigProto()
config.gpu_options.visible_device_list = "0"
set_session(tf.Session(config=config))
我注意到,对于一个GPU,利用率保持在95%到100%之间(让我们称之为GPU A),这是正常的。然而,在某些日子(!),其他GPU(B)的利用率在0%和100%之间波动很大,导致模型训练缓慢(我确保GPU真的使用 - 所以我不是无意中使用CPU )。此外,功率使用似乎相当低并且随着利用率而波动。内存在两个GPU上都正确分配。对于执行GPU而言,温度为80摄氏度;对于另一个,温度为90摄氏度(它们以次优的方式构建,因此GPU A也使GPU B稍微加热,因此温度差异。
可能导致这种情况的原因是什么? Tensorflow?驱动程序? PSU(当前的一个是> 1150)?温度?糟糕的GPU?
---更新1 ---
仅使用1个GPU时不会出现波动性能。
此外,这种现象最近也以相反的方式发生。 GPU A表现不佳而GPU B表现还不错。
另外,我认为这不是温度问题,因为当两张卡都冷(刚刚测试过)时也会发生这种情况。因此,我想这归结为(a)Tensorflow,(b)司机,(c)PSU?
---更新2 ---
这里是执行模型装配(1)以及批处理和图像增强(2,3)的函数。我将一个预先训练的InceptionV3模型提供给make_pretrained_model函数,并将返回的模型从该函数提供给train_model。基本上,在每个GPU上,其中一个模型的设置略有不同。
def make_pretrained_model(model, num_classes, frozen_top_layers=0, lr=0.0001, num_size_dense=[]):
'''
Building architecture of pretrained model
'''
in_inc = model.output
gap1 = GlobalAveragePooling2D()(in_inc)
for size, drop in num_size_dense:
gap1 = Dense(size, activation='relu')(gap1)
gap1 = Dropout(drop)(gap1)
softmax = Dense(num_classes, activation='softmax')(gap1)
model = Model(inputs=model.input, outputs=softmax)
if frozen_top_layers > 0:
for layer in model.layers[:-frozen_top_layers]:
layer.trainable = True
for layer in model.layers[-frozen_top_layers:]:
layer.trainable = False
else:
for layer in model.layers:
layer.trainable = True
opt = rmsprop(lr=lr, decay=1e-6)
model.compile(optimizer=opt, loss='categorical_crossentropy', metrics=['accuracy'])
return model
def augment_image_batch(imgdatagen, x, y, batch_size):
return imgdatagen.flow(x, y, batch_size=batch_size).__next__()
def train_model(model, x_train, y_train, x_test, y_test, train_batch_size, test_batch_size,
epochs, model_dir, model_name, num_classes, int_str, preprocessing_fun, patience=5, monitor='val_acc'):
'''
Training function
'''
train_img_datagen = ImageDataGenerator(
featurewise_center=False, # also use in test gen if activated AND fit test_gen on train data
samplewise_center=False,
featurewise_std_normalization=False, # also use in test gen if activated AND fit test_gen on train data
samplewise_std_normalization=False,
zca_whitening=False,
zca_epsilon=0,
rotation_range=0.05,
width_shift_range=0.05,
height_shift_range=0.05,
channel_shift_range=0,
fill_mode='nearest',
cval=0,
vertical_flip=False,
preprocessing_function=preprocessing_fun,
shear_range=0.,
zoom_range=0.,
horizontal_flip=False)
val_img_datagen = ImageDataGenerator(
preprocessing_function=preprocessing_fun)
callbacks = [ModelCheckpoint(MODEL_DIR+model_name+'.h5',
monitor=monitor,
save_best_only=True),
EarlyStopping(monitor=monitor, patience=patience),
TensorBoard(LOG_DIR+model_name+'_'+str(time())),
ReduceLROnPlateau(monitor='val_loss', factor=0.75, patience=2)]
def train_feat_gen(x_train, y_train, train_batch_size):
while True:
for batch in range(len(x_train) // train_batch_size + 1):
if batch > max(range(len(x_train) // train_batch_size)):
x, y = x_train[batch*train_batch_size:].astype(float), y_train[batch*train_batch_size:]
yield augment_image_batch(train_img_datagen, x, y, train_batch_size, False)
else:
x, y = x_train[batch*train_batch_size:(1+batch)*train_batch_size].astype(float), y_train[batch*train_batch_size:(1+batch)*train_batch_size]
yield augment_image_batch(train_img_datagen, x, y, train_batch_size, False)
def val_feat_gen(x_val, y_val, test_batch_size):
while True:
for batch in range(len(x_val) // test_batch_size + 1):
if batch > max(range(len(x_val) // test_batch_size)):
x, y = x_val[batch*test_batch_size:].astype(float), y_val[batch*test_batch_size:]
yield augment_image_batch(val_img_datagen, x, y, test_batch_size, True)
else:
x, y = x_val[batch*test_batch_size:(1+batch)*test_batch_size].astype(float), y_val[batch*test_batch_size:(1+batch)*test_batch_size]
yield augment_image_batch(val_img_datagen, x, y, test_batch_size, True)
train_gen_new = train_feat_gen(x_train, y_train, train_batch_size)
val_gen_new = val_feat_gen(x_test, y_test, test_batch_size)
model.fit_generator(
train_gen_new,
steps_per_epoch=len(x_train) // train_batch_size,
epochs=epochs,
validation_data=val_gen_new,
validation_steps=len(y_test) // test_batch_size,
callbacks=callbacks,
shuffle=True)