Question

我已经完成了多GPU版本word2vec，并且我在代码中应用了log_device_placement，该代码显示了一些操作已应用于多GPU：

2019-06-27 00：32：34.536178：我 tensorflow / core / common_runtime / placer.cc：874] optimizer_7 / gradients / loss_7 / sampled_losses / Log1p_grad / add / x：（常量）/作业：本地主机/副本：0 /任务：0 /设备：GPU：7 optimizer_6 / gradients / loss_6 / sampled_losses / Log1p_grad / add / x ：（常量）： / job：本地主机/副本：0 /任务：0 /设备：GPU：6 2019-06-27 00：32：34.536188：我tensorflow / core / common_runtime / placer.cc：874] optimizer_6 / gradients / loss_6 / sampled_losses / Log1p_grad / add / x：（常量）/作业：本地主机/副本：0 /任务：0 /设备：GPU：6 optimizer_5 / gradients / loss_5 / sampled_losses / Log1p_grad / add / x ：（常量）： / job：本地主机/副本：0 /任务：0 /设备：GPU：5 2019-06-27 00：32：34.536202：我tensorflow / core / common_runtime / placer.cc：874] optimizer_5 / gradients / loss_5 / sampled_losses / Log1p_grad / add / x：（常量）/作业：本地主机/副本：0 /任务：0 /设备：GPU：5 optimizer_4 / gradients / loss_4 / sampled_losses / Log1p_grad / add / x ：（常量）： / job：本地主机/副本：0 /任务：0 /设备：GPU：4 2019-06-27 00：32：34.536216：我tensorflow / core / common_runtime / placer.cc：874] optimizer_4 / gradients / loss_4 / sampled_losses / Log1p_grad / add / x：（常量）/作业：本地主机/副本：0 /任务：0 /设备：GPU：4 optimizer_3 / gradients / loss_3 / sampled_losses / Log1p_grad / add / x ：（常量）： / job：本地主机/副本：0 /任务：0 /设备：GPU：3 2019-06-27 00：32：34.536231：我tensorflow / core / common_runtime / placer.cc：874] optimizer_3 / gradients / loss_3 / sampled_losses / Log1p_grad / add / x：（常量）/作业：本地主机/副本：0 /任务：0 /设备：GPU：3 optimizer_2 / gradients / loss_2 / sampled_losses / Log1p_grad / add / x ：（常量）： / job：本地主机/副本：0 /任务：0 /设备：GPU：2 2019-06-27 00：32：34.536246：我tensorflow / core / common_runtime / placer.cc：874] Optimizer_2 / gradients / loss_2 / sampled_losses / Log1p_grad / add / x：（常量）/作业：本地主机/副本：0 /任务：0 /设备：GPU：2 optimizer_1 / gradients / loss_1 / sampled_losses / Log1p_grad / add / x ：（常量）： / job：本地主机/副本：0 /任务：0 /设备：GPU：1 2019-06-27 00：32：34.536273：我tensorflow / core / common_runtime / placer.cc：874] optimizer_1 / gradients / loss_1 / sampled_losses / Log1p_grad / add / x：（常量）/作业：本地主机/副本：0 /任务：0 /设备：GPU：1 优化程序/渐变/损失/ sampled_loss / Log1p_grad /添加/ x ：（常量）： / job：本地主机/副本：0 /任务：0 /设备：GPU：0 2019-06-27 00：32：34.536288：我tensorflow / core / common_runtime / placer.cc：874] 优化程序/渐变/损失/ sampled_loss / Log1p_grad /添加/ x：（const）/ job：localhost / replica：0 / task：0 / device：GPU：0 ......

但是nvidia-smi当时只显示了一份GPU工作：

+ ---------------------------------------------- ------------------------------- + | NVIDIA-SMI 396.44驱动程序版本：396.44 |   | ------------------------------- + ----------------- ----- + ---------------------- + | GPU名称持久性-M |总线编号Disp.A |挥发性不佳。 ECC | |风扇   Temp Perf Pwr：用法/上限|内存使用| GPU实用计算M。   | ============================== + ================= ===== + ==================== | | 0 GeForce GTX 108 ...关闭| 00000000：04：00.0关闭| N / A | | 36％49C P2   80W / 250W | 10882MiB / 11178MiB | 26％默认|   + ------------------------------- + ----------------- ----- + ---------------------- + | 1 GeForce GTX 108 ...关闭| 00000000：06：00.0关闭| N / A | | 29％39C P2   56W / 250W | 10631MiB / 11178MiB | 0％默认|   + ------------------------------- + ----------------- ----- + ---------------------- + | 2 GeForce GTX 108 ...关闭| 00000000：07：00.0关闭| N / A | | 29％36C P2   54W / 250W | 10631MiB / 11178MiB | 0％默认|   + ------------------------------- + ----------------- ----- + ---------------------- + | 3 GeForce GTX 108 ...关闭| 00000000：08：00.0关闭| N / A | | 29％38C P2   55W / 250W | 10631MiB / 11178MiB | 0％默认|   + ------------------------------- + ----------------- ----- + ---------------------- + | 4 GeForce GTX 108 ...关闭| 00000000：0C：00.0灭| N / A | | 29％38C P2   55W / 250W | 10631MiB / 11178MiB | 0％默认|   + ------------------------------- + ----------------- ----- + ---------------------- + | 5 GeForce GTX 108 ...关闭| 00000000：0D：00.0关闭| N / A | | 29％33C P2   55W / 250W | 10631MiB / 11178MiB | 0％默认|   + ------------------------------- + ----------------- ----- + ---------------------- + | 6 GeForce GTX 108 ...关闭| 00000000：0E：00.0关闭| N / A | | 29％37C P2   55W / 250W | 10631MiB / 11178MiB | 0％默认|   + ------------------------------- + ----------------- ----- + ---------------------- + | 7 GeForce GTX 108 ...关闭| 00000000：0F：00.0灭| N / A | | 29％36C P2   54W / 250W | 10663MiB / 11178MiB | 6％违约|   + ------------------------------- + ----------------- ----- + ---------------------- +

+ ---------------------------------------------- ------------------------------- + |进程：GPU内存| | GPU PID类型进程名称用法|   | ================================================= ========================== | | 0 38130 C python 8987MiB | | 1 38130 C python 10621MiB | | 2 38130摄氏度   python 10621MiB | | 3 38130 C python 10621MiB | | 4 38130 C蟒蛇   10621MiB | | 5 38130 C蟒蛇10621MiB | | 6 38130 C蟒蛇10621MiB |   | 7 38130 C python 10653MiB |   + ------------------------------------------------- ---------------------------- +`

我在此处附上我的源代码：

...
with tf.name_scope('inputs'):
train_inputs = tf.placeholder(tf.int32, shape=[batch_size])
train_labels = tf.placeholder(tf.int32, shape=[batch_size, 1])

    upper = 4
    for i in range(0,upper):
        with tf.device(tf.DeviceSpec(device_type="GPU", device_index=i)):
            data_size = batch_size / upper
            data_size = int(data_size)
            print(data_size)
            _train_inputs = train_inputs[i * data_size : (i + 1) * data_size]
            _train_labels = train_labels[i * data_size : (i + 1) * data_size]


            with tf.name_scope('embeddings'):
                if prev_emb_model == '0': 
                    embeddings = tf.Variable(
                        tf.random_uniform([vocabulary_size, embedding_size], -1.0, 1.0))
                else:
                    add_on_emb = tf.random_uniform([vocabulary_size - len(emb), embedding_size], -1.0, 1.0)
                    embeddings = tf.concat([emb, add_on_emb], 0)
                embed = tf.nn.embedding_lookup(embeddings, _train_inputs)

            # Construct the variables for the NCE loss
            with tf.name_scope('weights'):
                nce_weights = tf.Variable(
                    tf.truncated_normal([vocabulary_size, embedding_size],
                                        stddev=1.0 / math.sqrt(embedding_size)))
            with tf.name_scope('biases'):
                nce_biases = tf.Variable(tf.zeros([vocabulary_size]))

                # Compute the average NCE loss for the batch.
            # tf.nce_loss automatically draws a new sample of the negative labels each
            # time we evaluate the loss.
            # Explanation of the meaning of NCE loss:
            #   http://mccormickml.com/2016/04/19/waord2vec-tutorial-the-skip-gram-model/

            # with tf.device(tf.DeviceSpec(device_type="GPU", device_index=0)):
            with tf.name_scope('loss'):
                loss = tf.reduce_mean(
                    tf.nn.nce_loss(
                        weights=nce_weights,
                        biases=nce_biases,
                        labels=_train_labels,
                        inputs=embed,
                        num_sampled=num_sampled,
                        num_classes=vocabulary_size))


            # Construct the SGD optimizer using a learning rate of 1.0.
            with tf.name_scope('optimizer'):
                optimizer = tf.train.GradientDescentOptimizer(
                    1.0).minimize(loss, colocate_gradients_with_ops=True)


    # Compute the cosine similarity between minibatch examples and all
    # embeddings.
    norm = tf.sqrt(tf.reduce_sum(tf.square(embeddings), 1, keepdims=True))
    normalized_embeddings = embeddings / norm

    # Add variable initializer.
    init = tf.global_variables_initializer()


config = tf.ConfigProto(allow_soft_placement=True,log_device_placement=True)
# config.gpu_options.allow_growth = True
with tf.Session(graph=graph, config=config) as session:

    #  We must initialize all variables before we use them.
    init.run()
    print('Initialized')
    average_loss = 0

    walks_data = []
    for w in walks:
        for n in w: 
            walks_data.append(n)

    for step in range(args.iter):
        print(step)

        batch_inputs, batch_labels = generate_batch(batch_size, 1,
                                                    window_size, walks_data)

        feed_dict = {train_inputs: batch_inputs, train_labels: batch_labels}


        _, loss_val = session.run([optimizer, loss],
                                            feed_dict=feed_dict,
                                            run_metadata=run_metadata)
        average_loss += loss_val

        if step % 2000 == 0:
        ...

    final_embeddings = normalized_embeddings.eval()

Tensorflow数据并行性|多GPU利用率只有一个为非零

0 个答案: