Question

我想构建一个项目，在任意时间将请求放入python Queue，一组tensorflow模型使用队列中的请求，并立即返回结果。

模型有不同的线程，不同的tf.Graph，但结构和重量值是相同的。

每个模型都使用tf.data.Dataset.from_generator来封装一个从队列中获取请求的python迭代器。

问题是，当有多个模型时，请求可能会被阻止，直到将来的请求到来。从测试结果来看，似乎python迭代器确实在它被放入队列时得到了请求，但没有结果来自模型。此外，似乎没有丢弃请求，但可能被tf数据集迭代器阻止。

这是我的测试代码：

# -*- coding: utf-8 -*-

import tensorflow as tf
import numpy as np
import sys
import random
import time

from queue import Queue
from concurrent.futures import ThreadPoolExecutor

thread_count=int(sys.argv[1])
request_queue=Queue(128)

def data_iter():
    while True:
        yield request_queue.get()

def task():
    with tf.Graph().as_default():
        ds=tf.data.Dataset.from_generator(data_iter, (tf.int32), output_shapes=([1, 8]))
        sample=ds.make_one_shot_iterator().get_next()
        with tf.Session() as sess:
            coord=tf.train.Coordinator()
            threads=tf.train.start_queue_runners(sess=sess, coord=coord)
            while not coord.should_stop():
                try:
                    result=sess.run(sample)
                    print(result)
                except:
                    coord.request_stop()
            coord.join(threads)

executor=ThreadPoolExecutor(thread_count)
try:
    for i in range(thread_count):
        executor.submit(task)

    rand=random.Random()
    for i in range(100):
        request_queue.put(np.full((1, 8), i, 'int32'))
        time.sleep(1e-3)#to let the model get request from the request_queue
        t=rand.randint(5,10)
        print('round {}, request_queue size is about {}, sleeping {} secs...'.format(i, request_queue.qsize(), t))
        time.sleep(t)
finally:
    for i in range(thread_count):
        request_queue.put(None)
    executor.shutdown()

环境：python 3.5.3，tensorflow 1.4.0

测试结果：

使用一个型号运行：python tf_ds_test.py 1

结果如下：

round 0, request_queue size is about 1, sleeping 6 secs...
2017-12-21 10:42:24.924251: I C:\tf_jenkins\home\workspace\rel-win\M\windows\PY\35\tensorflow\core\platform\cpu_feature_guard.cc:137] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX AVX2
[[0 0 0 0 0 0 0 0]]
[[1 1 1 1 1 1 1 1]]
round 1, request_queue size is about 0, sleeping 6 secs...
[[2 2 2 2 2 2 2 2]]
round 2, request_queue size is about 0, sleeping 5 secs...
[[3 3 3 3 3 3 3 3]]
round 3, request_queue size is about 0, sleeping 7 secs...
[[4 4 4 4 4 4 4 4]]
round 4, request_queue size is about 0, sleeping 6 secs...
[[5 5 5 5 5 5 5 5]]
round 5, request_queue size is about 0, sleeping 7 secs...
...

一切顺利。

但是当使用32个模型运行时：python tf_ds_test.py 32

结果如下：

2017-12-21 10:45:41.660251: I C:\tf_jenkins\home\workspace\rel-win\M\windows\PY\35\tensorflow\core\platform\cpu_feature_guard.cc:137] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX AVX2
round 0, request_queue size is about 1, sleeping 9 secs...
[[0 0 0 0 0 0 0 0]]
[[1 1 1 1 1 1 1 1]]
round 1, request_queue size is about 0, sleeping 5 secs...
round 2, request_queue size is about 0, sleeping 8 secs...
round 3, request_queue size is about 0, sleeping 10 secs...
[[4 4 4 4 4 4 4 4]]
[[2 2 2 2 2 2 2 2]]
[[3 3 3 3 3 3 3 3]]
round 4, request_queue size is about 0, sleeping 8 secs...
round 5, request_queue size is about 0, sleeping 6 secs...
round 6, request_queue size is about 0, sleeping 10 secs...
[[6 6 6 6 6 6 6 6]]
[[5 5 5 5 5 5 5 5]]
round 7, request_queue size is about 0, sleeping 9 secs...
[[7 7 7 7 7 7 7 7]]
round 8, request_queue size is about 0, sleeping 5 secs...
round 9, request_queue size is about 0, sleeping 10 secs...
round 10, request_queue size is about 0, sleeping 6 secs...
round 11, request_queue size is about 0, sleeping 10 secs...
[[8 8 8 8 8 8 8 8]]
round 12, request_queue size is about 0, sleeping 8 secs...

请求已被阻止！ python迭代器立即消耗了请求，但模型在任意时间段之前都没有给出结果，直到模型得到它的下一个请求。

有人有任何想法吗？如何让这些模型立即返回结果？

Answer 1

您是否可以将生成元素的循环修改为：

for i in range(100):
    request_queue.put(np.full((1, 8), i, 'int32'))
    print('round {}, queue size {}'.format(i, request_queue.qsize()))

并分享输出？

我尝试重现你的问题（使用TF的每晚构建），但即使有1000个任务和10000次迭代循环，事情仍然顺利进行。

你能用TF的夜间版本试试这个吗？

Tensorflow Dataset.from_generator阻止输入？

1 个答案: