Question

注意：虽然示例代码是python，但是一旦图形反序列化并且似乎与语言无关，也会出现此问题。

问题

是否可以尝试/更改以减少tensorflow.python.ops.lookup_ops.index_table_from_file的内存使用量？

问题的起源

这个问题源于试图为包含tensorflow.python.ops.lookup_ops.index_table_from_file的Tensorflow模型提供服务，以检查词汇中某个单词的出现情况。进行预测时，模型的内存使用量增加到磁盘上词汇量的约8倍（磁盘上词汇350MB，运行时总内存使用量超过3GB）。

详细信息

其原因似乎是初始化index_table_from_file时创建的哈希表的大小。使用此表查找自包含在任何模型之外时，也会看到内存增加。使用调试器，在初始化时会增加内存。

from tensorflow.python.ops.lookup_ops import index_table_from_file
lookup_table = index_table_from_file(
    vocabulary_file=lookup_table_filename,
    default_value=-1,
    num_oov_buckets=0,
    vocab_size=None,
    name='vocab_table',
    key_dtype=tf.string,
    key_column_index=TextFileIndex.WHOLE_LINE,
    value_column_index=TextFileIndex.LINE_NUMBER
)

我尝试过的事情

我试图通过ConfigProto更改并行线程，以查看这是否减少了内存使用量但没有改进。

完整复制 这会生成并保存一个84MB的文件

运行以下命令将打开一个tensorflow调试会话。

import tensorflow as tf
import numpy as np
import string
import random
from tensorflow.python.ops.lookup_ops import index_table_from_file
from tensorflow.python import debug as tf_debug

lookup_table_filename = "./lookup_table.csv"
data_filename = './vocab_feature.npz'

def generate_data():
    """creates the required datafiles"""
    letters = [i for i in string.ascii_letters]
    vocab = set([
        i + i + j + j + k + k + l + l
        for i in letters
        for j in letters
        for k in letters
        for l in letters
    ])

    positive_vocab = set(random.sample(vocab, 1000000))
    with open(lookup_table_filename, 'w') as f:
        f.write('\n'.join(positive_vocab))
        f.write('\n')
        f.write('\n'.join(map(str, range(10**7))))
        f.write('\n')

    vocab_feature = np.random.choice(list(vocab), size=1000, replace=True)
    np.savez(file=data_filename, vocab_feature=vocab_feature)

# Run the first time
generate_data()

vocab_feature = np.load(data_filename)['vocab_feature']
lookup_table = index_table_from_file(
    vocabulary_file=lookup_table_filename,
    default_value=-1,
    num_oov_buckets=0,
    vocab_size=None,
    name='vocab_table',
    key_dtype=tf.string  
)
text = tf.placeholder(dtype=tf.string, shape=[None, ])
index = lookup_table.lookup(text)
sess = tf.Session()
sess = tf_debug.LocalCLIDebugWrapperSession(sess)
sess.run(tf.tables_initializer())
# sess.run(tf.global_variables_initializer())
np_index = sess.run(index, feed_dict={text: vocab_feature})
sess.close()
print(np_index)

交叉发布到https://github.com/tensorflow/tensorflow/issues/24936

lookup_ops与磁盘上的大小相比，内存使用率较高

0 个答案: