为什么我的 Tensorflow 模型停止训练

时间:2021-07-27 21:23:55

标签: python gpu tensor amd

我正在 OS X 11.5 上使用 Tensorflow 2.5 训练 Tensorflow 模型“VariationalDeepSemanticHashing”。

https://github.com/unsuthee/VariationalDeepSemanticHashing

模型在 5 个时期和 5427 个批次后停止训练,python 进程大小为 46.49 GB。我正在运行 tensorflow-macos 和 tensorflow-metal。 Mac 配备 128GB DRAM 和配备 14GB RAM 的 AMD Radeon Pro 5700 XT。

Tensorflow-profiler 还没有工作......所以我不知道发生了什么。

How do I get the Tensorflow Profiler working in Tensorflow 2.5 with 'tensorflow-macos' and 'tensorflow-metal'

enter image description here

enter image description here

enter image description here

https://github.com/unsuthee/VariationalDeepSemanticHashing

from __future__ import print_function

import tensorflow as tf
import numpy as np
from utils import *
from VDSH import *

gpu_devices = tf.config.experimental.list_physical_devices('GPU')
tf.config.experimental.set_memory_growth(gpu_devices[0], True)
tf.profiler.start('~/logdir')

filename = 'dataset/ng20.tfidf.mat'
data = Load_Dataset(filename)

latent_dim = 32
sess = get_session("0", 0.50) # choose the GPU and how much memory in percentage that we need
model = VDSH(sess, latent_dim, data.n_feas)

# create an optimizer
learning_rate = 0.001
decay_rate = 0.96
# decay_step = 10000
step = tf.Variable(0, trainable=False)

lr = tf.train.exponential_decay(learning_rate,
                                step,
                                10000,
                                decay_rate,
                                staircase=True, name="lr")

my_optimizer = tf.train.AdamOptimizer(learning_rate=lr) \
    .minimize(model.cost, global_step=step)

init = tf.global_variables_initializer()
model.sess.run(init)
#merged = tf.summary.merge_all()
#model.merged = merged

total_epoch = 20
kl_weight = 0.
kl_inc = 1 / 5000. # set the annealing rate for KL loss

#saver = tf.train.Saver()
#writer = tf.summary.FileWriter("~/logdir/ + '/', graph=model.sess.graph")

for epoch in range(total_epoch):
    epoch_loss = []
    for i in range(len(data.train)):
        # get doc
        doc = data.train[i]
        word_indice = np.where(doc > 0)[0]

        # indices
        opt, loss = model.sess.run((my_optimizer, model.cost), 
                                    feed_dict={model.input_bow: doc.reshape((-1, data.n_feas)),
                                               model.input_bow_idx: word_indice,
                                               model.kl_weight: kl_weight,
                                               model.keep_prob: 0.9})


        kl_weight = min(kl_weight + kl_inc, 1.0)
        epoch_loss.append(loss)


        if i % 50 == 0:
            print("\rEpoch:{}/{} {}/{}: Loss:{:.3f} AvgLoss:{:.3f}"
                  .format(epoch+1, total_epoch, i, len(data.train), loss, np.mean(epoch_loss)), end='')
            #print(tf.config.experimental.get_memory_info('GPU:0'))

        # Tensorboard Statistics
            #merged = model.sess.run([model.merged])
            #writer.add_summary(merged, step)
            #writer.flush()
            #writer.close()
            #save_path = savers.save(model.sess, "~/logdir"+ 'model.chkpt')
            #writer = tf.summary.FileWriter('~/logdir' + '/', graph=model.sess.graph)

    # Tensorboard Statistics
        #merged = model.sess.run([model.merged])
        #writer.add_summary(merged, step)
        #writer.flush()
        #writer.close()
        #save_path = savers.save(model.sess, "~/logdir"+ 'model.chkpt')
        #writer = tf.summary.FileWriter('~/logdir' + '/', graph=model.sess.graph)

tf.profiler.stop()

# run experiment here
zTrain = model.transform(data.train)
zTest = model.transform(data.test)
zTrain = np.array(zTrain)
zTest = np.array(zTest)
medHash = MedianHashing()
cbTrain = medHash.fit_transform(zTrain)
cbTest = medHash.transform(zTest)

TopK=100
print('Retrieve Top{} candidates using hamming distance'.format(TopK))
results = run_topK_retrieval_experiment(cbTrain, cbTest, data.gnd_train, data.gnd_test, TopK)

似乎是进程内存限制 例如。 ru_maxrss=47493120

import resource
res_limits = resource.getrusage(resource.RUSAGE_SELF)
print(res_limits)

resource.struct_rusage(ru_utime=0.340784, ru_stime=0.11195999999999999, ru_maxrss=47493120, ru_ixrss=0, ru_idrss=0, ru_isrss=0, ru_minflt=20600, ru_majflt=0, ru_nswap=0, ru_inblock=0, ru_oublock=0, ru_msgsnd=168, ru_msgrcv=131, ru_nsignals=0, ru_nvcsw=313, ru_nivcsw=1024)

47GB ru_maxrss 似乎在 Jupyter 实验室中。在 python 3.8.2 shell 中,它是 75gb。

 % python
Python 3.8.5 (default, Sep  4 2020, 02:22:02) 
[Clang 10.0.0 ] :: Anaconda, Inc. on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> import resource
>>> res_limits = resource.getrusage(resource.RUSAGE_SELF)
>>> print(res_limits)
resource.struct_rusage(ru_utime=0.018726, ru_stime=0.021442, ru_maxrss=7528448, ru_ixrss=0, ru_idrss=0, ru_isrss=0, ru_minflt=1430, ru_majflt=601, ru_nswap=0, ru_inblock=0, ru_oublock=0, ru_msgsnd=0, ru_msgrcv=0, ru_nsignals=0, ru_nvcsw=466, ru_nivcsw=103)

0 个答案:

没有答案