我正在尝试实施此代码:https://github.com/leehomyc/cyclegan-1
一切都很好,直到我的GPU耗尽内存。我的直觉告诉我,在算法的每个时代都应该释放GPU资源,为什么这似乎不是这样呢?它就像它总是积累GPU资源,直到它不能分配更多然后它抛出错误。我试过限制GPU的使用,但它似乎也无法正常工作。任何信息将不胜感激。我正在处理的图像集大小约为100 MB,我的图形卡有4 GB。谁可以指出我的错误?或者,如果您需要更多信息,请告诉我,我会为您提供。谢谢。最后,使用tf.train.coordinator
,tf.train.start_queue_runners(coord=coord)
或tf.summary.FileWriter(self._output_dir)
会导致此错误吗?谢谢
ERROR
Model/g_B/c6/Conv/weights:0
Model/g_B/c6/Conv/biases:0
2017-11-29 11:20:40.825993: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use SSE4.1 instructions, but these are available on your machine and could speed up CPU computations.
2017-11-29 11:20:40.826013: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use SSE4.2 instructions, but these are available on your machine and could speed up CPU computations.
2017-11-29 11:20:40.826017: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use AVX instructions, but these are available on your machine and could speed up CPU computations.
2017-11-29 11:20:40.826020: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use AVX2 instructions, but these are available on your machine and could speed up CPU computations.
2017-11-29 11:20:40.826040: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use FMA instructions, but these are available on your machine and could speed up CPU computations.
2017-11-29 11:20:40.955576: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:893] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2017-11-29 11:20:40.956141: I tensorflow/core/common_runtime/gpu/gpu_device.cc:955] Found device 0 with properties:
name: GeForce GTX 1050 Ti
major: 6 minor: 1 memoryClockRate (GHz) 1.62
pciBusID 0000:01:00.0
Total memory: 3.94GiB
Free memory: 3.56GiB
2017-11-29 11:20:40.956177: I tensorflow/core/common_runtime/gpu/gpu_device.cc:976] DMA: 0
2017-11-29 11:20:40.956188: I tensorflow/core/common_runtime/gpu/gpu_device.cc:986] 0: Y
2017-11-29 11:20:40.956207: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1045] Creating TensorFlow device (/gpu:0) -> (device: 0, name: GeForce GTX 1050 Ti, pci bus id: 0000:01:00.0)
hereheeeheheheeheh
2017-11-29 11:20:40.958641: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1045] Creating TensorFlow device (/gpu:0) -> (device: 0, name: GeForce GTX 1050 Ti, pci bus id: 0000:01:00.0)
('In the epoch ', 0)
Saving image 0/20
Saving image 1/20
Saving image 2/20
Saving image 3/20
Saving image 4/20
Saving image 5/20
Saving image 6/20
Saving image 7/20
Saving image 8/20
Saving image 9/20
Saving image 10/20
Saving image 11/20
Saving image 12/20
Saving image 13/20
Saving image 14/20
Saving image 15/20
Saving image 16/20
Saving image 17/20
Saving image 18/20
Saving image 19/20
Processing batch 0/200
Garbage collector: collected 0 objects.
('lets see: ', None)
Processing batch 1/200
Garbage collector: collected 0 objects.
('lets see: ', None)
Processing batch 2/200
Garbage collector: collected 0 objects.
**ETC**
Processing batch 194/200
Garbage collector: collected 0 objects.
('lets see: ', None)
Processing batch 195/200
Garbage collector: collected 0 objects.
('lets see: ', None)
Processing batch 196/200
Garbage collector: collected 0 objects.
('lets see: ', None)
Processing batch 197/200
Garbage collector: collected 0 objects.
('lets see: ', None)
Processing batch 198/200
Garbage collector: collected 0 objects.
('lets see: ', None)
Processing batch 199/200
Garbage collector: collected 0 objects.
('lets see: ', None)
('In the epoch ', 1)
Saving image 0/20
Saving image 1/20
Saving image 2/20
Saving image 3/20
Saving image 4/20
Saving image 5/20
Saving image 6/20
Saving image 7/20
Saving image 8/20
Saving image 9/20
Saving image 10/20
Saving image 11/20
Saving image 12/20
Saving image 13/20
Saving image 14/20
Saving image 15/20
Saving image 16/20
Saving image 17/20
Saving image 18/20
Saving image 19/20
Processing batch 0/200
Garbage collector: collected 0 objects.
('lets see: ', None)
Processing batch 1/200
Garbage collector: collected 0 objects.
('lets see: ', None)
Processing batch 2/200
Garbage collector: collected 0 objects.
('lets see: ', None)
Processing batch 3/200
Garbage collector: collected 0 objects.
('lets see: ', None)
Processing batch 4/200
Garbage collector: collected 0 objects.
('lets see: ', None)
Processing batch 5/200
Garbage collector: collected 0 objects.
('lets see: ', None)
Processing batch 6/200
Garbage collector: collected 0 objects.
('lets see: ', None)
Processing batch 7/200
Garbage collector: collected 0 objects.
('lets see: ', None)
Processing batch 8/200
Garbage collector: collected 0 objects.
('lets see: ', None)
Processing batch 9/200
Garbage collector: collected 0 objects.
('lets see: ', None)
Processing batch 10/200
2017-11-29 11:25:06.741162: E tensorflow/stream_executor/cuda/cuda_driver.cc:955] failed to alloc 4294967296 bytes on host: CUDA_ERROR_OUT_OF_MEMORY
2017-11-29 11:25:06.742407: W ./tensorflow/core/common_runtime/gpu/pool_allocator.h:195] could not allocate pinned host memory of size: 4294967296
2017-11-29 11:25:06.742620: E tensorflow/stream_executor/cuda/cuda_driver.cc:955] failed to alloc 3865470464 bytes on host: CUDA_ERROR_OUT_OF_MEMORY
2017-11-29 11:25:06.742630: W ./tensorflow/core/common_runtime/gpu/pool_allocator.h:195] could not allocate pinned host memory of size: 3865470464
2017-11-29 11:25:06.742826: E tensorflow/stream_executor/cuda/cuda_driver.cc:955] failed to alloc 3478923264 bytes on host: CUDA_ERROR_OUT_OF_MEMORY
2017-11-29 11:25:06.742835: W ./tensorflow/core/common_runtime/gpu/pool_allocator.h:195] could not allocate pinned host memory of size: 3478923264
Killed
培训方法代码
def train(self):
"""Training Function."""
# Load Dataset from the dataset folder
self.inputs = data_loader.load_data(
self._dataset_name, self._size_before_crop,
True, self._do_flipping)
# Build the network
self.model_setup()
# Loss function calculations
self.compute_losses()
# Initializing the global variables
init = (tf.global_variables_initializer(),
tf.local_variables_initializer())
saver = tf.train.Saver()
max_images = cyclegan_datasets.DATASET_TO_SIZES[self._dataset_name]
#gpu_options = tf.GPUOptions(per_process_gpu_memory_fraction=0.5)
print("hereheeeheheheeheh")
config=tf.ConfigProto()
config.gpu_options.allow_growth=True
with tf.Session(config=config) as sess:
sess.run(init)
# Restore the model to run the model from last checkpoint
if self._to_restore:
chkpt_fname = tf.train.latest_checkpoint(self._checkpoint_dir)
saver.restore(sess, chkpt_fname)
writer = tf.summary.FileWriter(self._output_dir)
if not os.path.exists(self._output_dir):
os.makedirs(self._output_dir)
coord = tf.train.Coordinator()
threads = tf.train.start_queue_runners(coord=coord)
# Training Loop
for epoch in range(sess.run(self.global_step), self._max_step):
print("In the epoch ", epoch)
saver.save(sess, os.path.join(
self._output_dir, "cyclegan"), global_step=epoch)
# Dealing with the learning rate as per the epoch number
if epoch < 100:
curr_lr = self._base_lr
else:
curr_lr = self._base_lr - \
self._base_lr * (epoch - 100) / 100
self.save_images(sess, epoch)
for i in range(0, max_images):
print("Processing batch {}/{}".format(i, max_images))
inputs = sess.run(self.inputs)
# Optimizing the G_A network
_, fake_B_temp, summary_str = sess.run(
[self.g_A_trainer,
self.fake_images_b,
self.g_A_loss_summ],
feed_dict={
self.input_a:
inputs['images_i'],
self.input_b:
inputs['images_j'],
self.learning_rate: curr_lr
}
)
writer.add_summary(summary_str, epoch * max_images + i)
fake_B_temp1 = self.fake_image_pool(
self.num_fake_inputs, fake_B_temp, self.fake_images_B)
# Optimizing the D_B network
_, summary_str = sess.run(
[self.d_B_trainer, self.d_B_loss_summ],
feed_dict={
self.input_a:
inputs['images_i'],
self.input_b:
inputs['images_j'],
self.learning_rate: curr_lr,
self.fake_pool_B: fake_B_temp1
}
)
writer.add_summary(summary_str, epoch * max_images + i)
# Optimizing the G_B network
_, fake_A_temp, summary_str = sess.run(
[self.g_B_trainer,
self.fake_images_a,
self.g_B_loss_summ],
feed_dict={
self.input_a:
inputs['images_i'],
self.input_b:
inputs['images_j'],
self.learning_rate: curr_lr
}
)
writer.add_summary(summary_str, epoch * max_images + i)
fake_A_temp1 = self.fake_image_pool(
self.num_fake_inputs, fake_A_temp, self.fake_images_A)
# Optimizing the D_A network
_, summary_str = sess.run(
[self.d_A_trainer, self.d_A_loss_summ],
feed_dict={
self.input_a:
inputs['images_i'],
self.input_b:
inputs['images_j'],
self.learning_rate: curr_lr,
self.fake_pool_A: fake_A_temp1
}
)
writer.add_summary(summary_str, epoch * max_images + i)
writer.flush()
collected = gc.collect()
print("Garbage collector: collected %d objects." % (collected))
print("lets see: ",writer.flush())
self.num_fake_inputs += 1
sess.run(tf.assign(self.global_step, epoch + 1))
coord.request_stop()
coord.join(threads)
writer.add_graph(sess.graph)