在tensorflow图中将大numpy矩阵作为分区变量加载

时间:2017-09-04 21:53:56

标签: tensorflow

想象一下,我有大量的预训练嵌入,我可以加载为numpy数组,例如形状[3000000,200]。此矩阵的大小大于2GB,因此使用以下代码:

data = np.zeros(shape=(3000000, 200))
variable = tf.get_variable(
    "weigths",
    [3000000, 200],
    initializer=tf.constant_initializer(data))

session = tf.Session()
session.run(tf.global_variables_initializer())

我收到错误ValueError: Cannot create a tensor proto whose content is larger than 2GB.

我可以使用tf.assign和占位符加载它,但由于某些原因,我想使用此嵌入权重的分区版本。使用assign和占位符的方式是关闭的,因为分区变量不适用于分配op:NotImplementedError: assign() has not been implemented for PartitionedVariable.

有可能做这样的事吗?

2 个答案:

答案 0 :(得分:3)

<强>解

这很难看,但确实有效:

def init_partitioned(session, var_name, data):
    partitioned_var = tf.get_collection(tf.GraphKeys.GLOBAL_VARIABLES, scope=var_name + "/part_\d+:0")
    print("For {} founded {} parts".format(var_name, len(partitioned_var)))

    dtype = partitioned_var[0].dtype
    part_shape = partitioned_var[0].get_shape().as_list()
    part_shape[0] = None

    init = tf.placeholder(dtype, part_shape)
    offset = 0
    for idx, part in enumerate(partitioned_var):
        init_op = tf.assign(part, init)
        numRowsInPart = int(part.get_shape()[0])
        session.run(init_op, feed_dict={init: data[offset:offset + numRowsInPart]})
        offset += numRowsInPart

答案 1 :(得分:0)

尝试:

import numpy as np
import tensorflow as tf

data = np.zeros(shape=(3000000, 200))

ph = tf.placeholder(tf.float32, shape=(3000000, 200))
variable = tf.Variable(ph)

session = tf.Session()
session.run(tf.global_variables_initializer(), feed_dict={ph:data})