转换mLSTM - 在多个GPU上运行它

时间:2017-12-03 14:34:49

标签: tensorflow deep-learning gpu lstm multi-gpu

我正在运行mLSTM(乘法LSTM)变换(基于mLSTM by OpenAi(只是变换,它已经过训练)但转换超过100,000个文档需要很长时间。< / p>

我希望它在多个GPU上运行。我看到了一些examples但我不知道如何在这个mLSTM转换代码上实现它。

我想在多个GPU上运行的特定部分是:

        def transform(xs):
            tstart = time.time()
            xs = [preprocess(x) for x in xs]
            lens = np.asarray([len(x) for x in xs])
            sorted_idxs = np.argsort(lens)
            unsort_idxs = np.argsort(sorted_idxs)
            sorted_xs = [xs[i] for i in sorted_idxs]
            maxlen = np.max(lens)
            offset = 0
            n = len(xs)
            smb = np.zeros((2, n, hps.nhidden), dtype=np.float32)
            for step in range(0, ceil_round_step(maxlen, nsteps), nsteps):
                start = step
                end = step+nsteps
                xsubseq = [x[start:end] for x in sorted_xs]
                ndone = sum([x == b'' for x in xsubseq])
                offset += ndone
                xsubseq = xsubseq[ndone:]
                sorted_xs = sorted_xs[ndone:]
                nsubseq = len(xsubseq)
                xmb, mmb = batch_pad(xsubseq, nsubseq, nsteps)
                for batch in range(0, nsubseq, nbatch):
                    start = batch
                    end = batch+nbatch
                    batch_smb = seq_rep(
                        xmb[start:end], mmb[start:end],
                        smb[:, offset+start:offset+end, :])
                    smb[:, offset+start:offset+end, :] = batch_smb
            features = smb[0, unsort_idxs, :]
            print('%0.3f seconds to transform %d examples' %
                  (time.time() - tstart, n))
            return features

这只是完整代码的一小部分(我不认为可以在此处复制整个代码)。

1 个答案:

答案 0 :(得分:1)

您所指的部分不是跨GPU分割计算的地方,它只会转换数据(在CPU上!)并运行会话。

正确的位置是定义计算图形的位置,例如def mlstm(inputs, c, h, M, ndim, scope='lstm', wn=False): [...] for idx, x in enumerate(inputs): with tf.device('/gpu:' + str(i % GPU_COUNT)): m = tf.matmul(x, wmx) * tf.matmul(h, wmh) z = tf.matmul(x, wx) + tf.matmul(m, wh) + b [...] 方法。有很多方法可以分割图形,例如将LSTM单元放置在不同的GPU上,以便可以并行处理输入序列:

log_device_placement

顺便说一句,tensorflow import tensorflow as tf # Creates a graph. with tf.device('/gpu:0'): a = tf.constant([1.0, 2.0, 3.0, 4.0, 5.0, 6.0], name='a') b = tf.constant([1.0, 2.0, 3.0, 4.0, 5.0, 6.0], name='b') c = tf.add(a, b) # Creates a session with log_device_placement set to True. with tf.Session(config=tf.ConfigProto(log_device_placement=True)) as sess: # Prints the following: # Device mapping: # /job:localhost/replica:0/task:0/device:GPU:0 -> device: 0, name: <GPU name>, pci bus id: 0000:01:00.0, compute capability: 6.1 # Add: (Add): /job:localhost/replica:0/task:0/device:GPU:0 # b: (Const): /job:localhost/replica:0/task:0/device:GPU:0 # a: (Const): /job:localhost/replica:0/task:0/device:GPU:0 print(sess.run(c)) 中有一个有用的配置选项,它有助于查看输出中的执行细节。这是一个例子:

{{1}}