我有一个8 GPU群集,当我运行piece of Tensorflow code(粘贴在下面)时,它只使用一个GPU而不是全部8.我使用nvidia-smi
确认了这一点。
# Set some parameters
IMG_WIDTH = 256
IMG_HEIGHT = 256
IMG_CHANNELS = 3
TRAIN_IM = './train_im/'
TRAIN_MASK = './train_mask/'
TEST_PATH = './test/'
warnings.filterwarnings('ignore', category=UserWarning, module='skimage')
num_training = len(os.listdir(TRAIN_IM))
num_test = len(os.listdir(TEST_PATH))
# Get and resize train images
X_train = np.zeros((num_training, IMG_HEIGHT, IMG_WIDTH, IMG_CHANNELS), dtype=np.uint8)
Y_train = np.zeros((num_training, IMG_HEIGHT, IMG_WIDTH, 1), dtype=np.bool)
print('Getting and resizing train images and masks ... ')
sys.stdout.flush()
#load training images
for count, filename in tqdm(enumerate(os.listdir(TRAIN_IM)), total=num_training):
img = imread(os.path.join(TRAIN_IM, filename))[:,:,:IMG_CHANNELS]
img = resize(img, (IMG_HEIGHT, IMG_WIDTH), mode='constant', preserve_range=True)
X_train[count] = img
name, ext = os.path.splitext(filename)
mask_name = name + '_mask' + ext
mask = cv2.imread(os.path.join(TRAIN_MASK, mask_name))[:,:,:1]
mask = resize(mask, (IMG_HEIGHT, IMG_WIDTH))
Y_train[count] = mask
# Check if training data looks all right
ix = random.randint(0, num_training-1)
print(ix)
imshow(X_train[ix])
plt.show()
imshow(np.squeeze(Y_train[ix]))
plt.show()
# Define IoU metric
def mean_iou(y_true, y_pred):
prec = []
for t in np.arange(0.5, 1.0, 0.05):
y_pred_ = tf.to_int32(y_pred > t)
score, up_opt = tf.metrics.mean_iou(y_true, y_pred_, 2)
K.get_session().run(tf.local_variables_initializer())
with tf.control_dependencies([up_opt]):
score = tf.identity(score)
prec.append(score)
return K.mean(K.stack(prec), axis=0)
# Build U-Net model
inputs = Input((IMG_HEIGHT, IMG_WIDTH, IMG_CHANNELS))
s = Lambda(lambda x: x / 255) (inputs)
width = 64
c1 = Conv2D(width, (3, 3), activation='relu', padding='same') (s)
c1 = Conv2D(width, (3, 3), activation='relu', padding='same') (c1)
p1 = MaxPooling2D((2, 2)) (c1)
c2 = Conv2D(width*2, (3, 3), activation='relu', padding='same') (p1)
c2 = Conv2D(width*2, (3, 3), activation='relu', padding='same') (c2)
p2 = MaxPooling2D((2, 2)) (c2)
c3 = Conv2D(width*4, (3, 3), activation='relu', padding='same') (p2)
c3 = Conv2D(width*4, (3, 3), activation='relu', padding='same') (c3)
p3 = MaxPooling2D((2, 2)) (c3)
c4 = Conv2D(width*8, (3, 3), activation='relu', padding='same') (p3)
c4 = Conv2D(width*8, (3, 3), activation='relu', padding='same') (c4)
p4 = MaxPooling2D(pool_size=(2, 2)) (c4)
c5 = Conv2D(width*16, (3, 3), activation='relu', padding='same') (p4)
c5 = Conv2D(width*16, (3, 3), activation='relu', padding='same') (c5)
u6 = Conv2DTranspose(width*8, (2, 2), strides=(2, 2), padding='same') (c5)
u6 = concatenate([u6, c4])
c6 = Conv2D(width*8, (3, 3), activation='relu', padding='same') (u6)
c6 = Conv2D(width*8, (3, 3), activation='relu', padding='same') (c6)
u7 = Conv2DTranspose(width*4, (2, 2), strides=(2, 2), padding='same') (c6)
u7 = concatenate([u7, c3])
c7 = Conv2D(width*4, (3, 3), activation='relu', padding='same') (u7)
c7 = Conv2D(width*4, (3, 3), activation='relu', padding='same') (c7)
u8 = Conv2DTranspose(width*2, (2, 2), strides=(2, 2), padding='same') (c7)
u8 = concatenate([u8, c2])
c8 = Conv2D(width*2, (3, 3), activation='relu', padding='same') (u8)
c8 = Conv2D(width*2, (3, 3), activation='relu', padding='same') (c8)
u9 = Conv2DTranspose(width, (2, 2), strides=(2, 2), padding='same') (c8)
u9 = concatenate([u9, c1], axis=3)
c9 = Conv2D(width, (3, 3), activation='relu', padding='same') (u9)
c9 = Conv2D(width, (3, 3), activation='relu', padding='same') (c9)
outputs = Conv2D(1, (1, 1), activation='sigmoid') (c9)
model = Model(inputs=[inputs], outputs=[outputs])
sgd = optimizers.SGD(lr=0.03, decay=1e-6, momentum=0.9, nesterov=True)
model.compile(optimizer=sgd, loss='binary_crossentropy', metrics=[mean_iou])
model.summary()
# Fit model
earlystopper = EarlyStopping(patience=20, verbose=1)
checkpointer = ModelCheckpoint('nuclei_only.h5', verbose=1, save_best_only=True)
results = model.fit(X_train, Y_train, validation_split=0.05, batch_size = 32, verbose=1, epochs=100,
callbacks=[earlystopper, checkpointer])
我想使用mxnet或其他方法来运行此代码所有可用的GPU。但是,我不知道该怎么做。所有资源仅显示如何在mnist数据集上执行此操作。我有自己的数据集,我的阅读方式不同。因此,不太确定如何修改代码。
答案 0 :(得分:5)
TL; DR :使用Keras的multi_gpu_model()
。
如果您的系统中有多个GPU,默认情况下将选择ID最低的GPU。
如果你想使用多个GPU,遗憾的是你必须手动指定每个GPU上放置的张量,如
with tf.device('/device:GPU:2'):
Tensorflow Guide Using Multiple GPUs中的更多信息。
在如何通过多个GPU分发网络方面,有两种主要方法。
您通过GPU分层分布网络。这更容易实现,但不会产生很多性能优势,因为GPU会等待彼此完成操作。
您可以创建单独的网络副本,称为"塔"在每个GPU上。当您提供八元组网络时,您将批量输入分为8个部分并分发它们。让网络向前传播,然后对梯度求和,并进行向后传播。这将导致almost-linear speedup具有GPU的数量。但是,实施起来要困难得多,因为您还必须处理与批量标准化相关的复杂性,并且非常建议您确保正确随机化批次。有a nice tutorial here。您还应该查看那里引用的Inception V3 code,了解如何构建此类内容。特别是_tower_loss()
,_average_gradients()
以及以train()
开头的for i in range(FLAGS.num_gpus):
部分。
如果您想尝试一下Keras,它现在已经使用multi_gpu_model()
显着简化了多gpu训练。它可以为你做所有繁重的工作。