Question

我正在尝试使用配备八个GPU的集群（NVIDIA GeForce RTX 2080）运行StyleGAN2。目前，我在training_loop.py中使用以下配置：

minibatch_size_dict     = {4: 512, 8: 256, 16: 128, 32: 64, 64: 32},       # Resolution-specific overrides.
minibatch_gpu_base      = 8,        # Number of samples processed at a time by one GPU.
minibatch_gpu_dict      = {},       # Resolution-specific overrides.
G_lrate_base            = 0.001,    # Learning rate for the generator.
G_lrate_dict            = {},       # Resolution-specific overrides.
D_lrate_base            = 0.001,    # Learning rate for the discriminator.
D_lrate_dict            = {},       # Resolution-specific overrides.
lrate_rampup_kimg       = 0,        # Duration of learning rate ramp-up.
tick_kimg_base          = 4,        # Default interval of progress snapshots.
tick_kimg_dict          = {4:10, 8:10, 16:10, 32:10, 64:10, 128:8, 256:6, 512:4}): # Resolution-specific overrides.

我正在使用一组512x52像素的图像进行训练。经过几次迭代，我得到了下面报告的错误消息，并且脚本似乎停止运行（使用watch nvidia-smi时，GPU的温度和风扇活动都降低了）。我已经减小了批处理大小，但是问题似乎出在其他地方。您有如何解决此问题的提示吗？

我能够使用相同的数据集运行StyleGAN。在论文中，他们说StyleGAN2应该不那么重，所以我有点惊讶。

这是我收到的错误消息：

2019-12-16 18:22:54.909009: E tensorflow/stream_executor/cuda/cuda_driver.cc:828] failed to allocate 334.11M (350338048 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY: out of memory
2019-12-16 18:22:54.909087: W tensorflow/core/common_runtime/bfc_allocator.cc:314] Allocator (GPU_0_bfc) ran out of memory trying to allocate 129.00MiB (rounded to 135268352).  Current allocation summary follows.
2019-12-16 18:22:54.918750: W tensorflow/core/common_runtime/bfc_allocator.cc:319] **_***************************_*****x****x******xx***_******************************_***************
2019-12-16 18:22:54.918808: W tensorflow/core/framework/op_kernel.cc:1502] OP_REQUIRES failed at conv_grad_input_ops.cc:903 : Resource exhausted: OOM when allocating tensor with shape[4,128,257,257] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc

Answer 1

StyleGAN2的config-f模型实际上大于StyleGAN1。尝试使用消耗较少VRAM的配置，例如config-e。您实际上可以通过在python命令中传递标志来更改模型的配置，如下所示：https://github.com/NVlabs/stylegan2/blob/master/run_training.py#L144

就我而言，我能够在2个RTX 2080ti上使用config-e训练StyleGAN2。

Answer 2

一个或多个高端NVIDIA GPU，NVIDIA驱动程序，CUDA 10.0工具包和cuDNN 7.5。要重现本文报告的结果，您可以需要具有至少16 GB DRAM的NVIDIA GPU。

您的NVIDIA GeForce RTX 2080卡具有11GB，但是我想您是说其中有8个？我不认为tensorflow是为开箱即用而设置的。

OOm-尽管减小了批次大小，但仍无法运行StyleGAN2

2 个答案: