行缓冲实现

Question

如何确定对sys.stdin.readline（）的调用（或更笼统地说，在任何基于文件描述符的文件对象上的readline（））是否会阻塞？

当我用python编写基于行的文本过滤器程序时出现此问题；也就是说，该程序反复从输入中读取一行文本，也许对其进行转换，然后将其写入输出。

我想实施一个合理的输出缓冲策略。我的标准是：

处理数百万个批量行-主要缓冲输出，偶尔刷新。
在保存缓冲的输出时，它永远不会阻塞输入。

因此，无缓冲的输出是不好的，因为它违反了（1）（对OS的写入过多）。行缓冲输出是不好的，因为它仍然违反（1）（将输出批量刷新到百万行中的每一行都没有意义）。而且默认缓冲的输出是不好的，因为它违反了（2）（如果输出是到文件或管道的话，它将不适当地保留输出）。

在大多数情况下，我认为一个好的解决方案是： “只要（其缓冲区已满或）sys.stdin.readline（）将要阻塞，就刷新sys.stdout”。可以实现吗？

（请注意，我并不是说这种策略对 all 个案例都是完美的。例如，在程序是cpu绑定的情况下，这可能并不理想；在这种情况下，这可能是明智的进行更频繁的刷新，以避免在进行长时间计算时保留输出。）

为了明确起见，假设我要在python中实现unix的“ cat -n”程序。

（实际上，“ cat -n”比每次一行都聪明；也就是说，它知道如何在读取整行之前读取和写入部分行；但是，在这个示例中，我还是要一次一行地实现它。）

行缓冲实现

（行为规范，但违反了标准（1），即，由于冲洗过多，它的速度不合理）：

#!/usr/bin/python
# cat-n.linebuffered.py
import sys
num_lines_read = 0
while True:
  line = sys.stdin.readline()
  if line == '': break
  num_lines_read += 1
  print("%d: %s" % (num_lines_read, line))
  sys.stdout.flush()

默认缓冲的实现

（快速，但违反了标准（2），即不友好的输出预扣）

#!/usr/bin/python
# cat-n.defaultbuffered.py
import sys
num_lines_read = 0
while True:
  line = sys.stdin.readline()
  if line == '': break
  num_lines_read += 1
  print("%d: %s" % (num_lines_read, line))

所需的实现：

#!/usr/bin/python
num_lines_read = 0
while True:
  if sys_stdin_readline_is_about_to_block():  # <--- How do I implement this??
    sys.stdout.flush()
  line = sys.stdin.readline()
  if line == '': break
  num_lines_read += 1
  print("%d: %s" % (num_lines_read, line))

问题是：是否可以实现sys_stdin_readline_is_about_to_block()？

我想要一个适用于python2和python3的答案。我研究了以下每种技术，但到目前为止，还没有任何进展。

使用select([sys.stdin],[],[],0)来确定从sys.stdin读取是否会阻塞。（当sys.stdin是一个缓冲的文件对象时，此操作不起作用，原因至少有两个，并且可能有两个原因：（1）如果准备好从基础输入管道读取部分行，它将错误地说“不会阻塞”，（2）如果sys.stdin的缓冲区包含完整的输入行，但底层管道还没有准备好进行其他读取...，我会错误地说“将阻止”。
无阻塞io，将os.fdopen(sys.stdin.fileno(), 'r')与fcntl和O_NONBLOCK一起使用（我无法在任何python版本中将其与readline（）配合使用：在python2.7中，只要有分行进入，它就会丢失输入；在python3中，似乎无法区分“ would block” 和输入结束。 ??）
asyncio（对我来说，尚不清楚python2中有哪些可用；而且我认为它不适用于sys.stdin；但是，我仍然对该答案感兴趣，仅当从subprocess.Popen（））返回的管道中读取时有效。
创建一个线程来进行readline()循环并将每一行传递给主线程通过queue.Queue编程;然后主程序可以在轮询队列之前从中读取每一行，并且每当看到行将要阻塞时，请先刷新stdout。（我尝试过，实际上使它起作用，请参见下文，但是它非常慢，比行缓冲慢得多。）

线程实现：

请注意，这并不能严格回答“如何判断sys.stdin.readline（）是否要阻塞”的问题，但是无论如何它都能实现所需的缓冲策略。不过太慢了。

#!/usr/bin/python
# cat-n.threaded.py
import queue
import sys
import threading
def iter_with_abouttoblock_cb(callable, sentinel, abouttoblock_cb, qsize=100):
  # child will send each item through q to parent.
  q = queue.Queue(qsize)
  def child_fun():
    for item in iter(callable, sentinel):
      q.put(item)
    q.put(sentinel)
  child = threading.Thread(target=child_fun)
  # The child thread normally runs until it sees the sentinel,
  # but we mark it daemon so that it won't prevent the parent
  # from exiting prematurely if it wants.
  child.daemon = True
  child.start()
  while True:
    try:
      item = q.get(block=False)
    except queue.Empty:
      # q is empty; call abouttoblock_cb before blocking
      abouttoblock_cb()
      item = q.get(block=True)
    if item == sentinel:
      break  # do *not* yield sentinel
    yield item
  child.join()

num_lines_read = 0
for line in iter_with_abouttoblock_cb(sys.stdin.readline,
                                      sentinel='',
                                      abouttoblock_cb=sys.stdout.flush):
  num_lines_read += 1
  sys.stdout.write("%d: %s" % (num_lines_read, line))

验证缓冲行为：

以下命令（在Linux上为bash）显示了预期的缓冲行为：“ defaultbuffered”缓冲过大，而“ linebuffered”和“ threaded”缓冲恰到好处。

（请注意，管道结尾处的| cat是默认使python块缓冲区而不是行缓冲区。）

for which in defaultbuffered linebuffered threaded; do
  for python in python2.7 python3.5; do
    echo "$python cat-n.$which.py:"
      (echo z; echo -n a; sleep 1; echo b; sleep 1; echo -n c; sleep 1; echo d; echo x; echo y; echo z; sleep 1; echo -n e; sleep 1; echo f) | $python cat-n.$which.py | cat
  done
done

输出：

python2.7 cat-n.defaultbuffered.py:
[... pauses 5 seconds here. Bad! ...]
1: z
2: ab
3: cd
4: x
5: y
6: z
7: ef
python3.5 cat-n.defaultbuffered.py:
[same]
python2.7 cat-n.linebuffered.py:
1: z
[... pauses 1 second here, as expected ...]
2: ab
[... pauses 2 seconds here, as expected ...]
3: cd
4: x
5: y
6: z
[... pauses 2 seconds here, as expected ...]
6: ef
python3.5 cat-n.linebuffered.py:
[same]
python2.7 cat-n.threaded.py:
[same]
python3.5 cat-n.threaded.py:
[same]

时间：

（在Linux上的bash中）：

for which in defaultbuffered linebuffered threaded; do
  for python in python2.7 python3.5; do
    echo -n "$python cat-n.$which.py:  "
      timings=$(time (yes 01234567890123456789012345678901234567890123456789012345678901234567890123456789 | head -1000000 | $python cat-n.$which.py >| /tmp/REMOVE_ME) 2>&1)
      echo $timings
  done
done
/bin/rm /tmp/REMOVE_ME

输出：

python2.7 cat-n.defaultbuffered.py:  real 0m1.490s user 0m1.191s sys 0m0.386s
python3.5 cat-n.defaultbuffered.py:  real 0m1.633s user 0m1.007s sys 0m0.311s
python2.7 cat-n.linebuffered.py:  real 0m5.248s user 0m2.198s sys 0m2.704s
python3.5 cat-n.linebuffered.py:  real 0m6.462s user 0m3.038s sys 0m3.224s
python2.7 cat-n.threaded.py:  real 0m25.097s user 0m18.392s sys 0m16.483s
python3.5 cat-n.threaded.py:  real 0m12.655s user 0m11.722s sys 0m1.540s

重申一下，我想要一个在保留缓冲输出时不会阻塞的解决方案（“ linebuffered”和“ threaded”在这方面都很好），而且速度也很快：也就是说，速度相当于“默认缓冲”。

Answer 1

您当然可以使用select：这就是它的用途，并且对于少数文件描述符来说，它的性能很好。您必须自己实现行缓冲/中断，以便在缓冲（实际上是）部分行后，检测是否还有更多输入可用。

您可以自己进行全部缓冲（这是合理的，因为select在文件描述符级别运行），也可以将stdin设置为非阻止并使用file.read()或BufferedReader.read()（取决于您的Python版本）消耗可用的内容。如果输入可能是Internet套接字，则无论缓冲如何，都必须使用非阻塞输入，因为select的常见实现会虚假地指示套接字中的可读数据。（在这种情况下，Python 2版本将IOError加上EAGAIN； Python 3版本返回None。）

（os.fdopen在这里无济于事，因为它没有为fcntl创建新的文件描述符供使用。在某些系统上，您可以打开{{ 1}}和/dev/stdin。）

基于默认O_NONBLOCK的Python 2实现：

file.read()

对于import sys,os,select,fcntl,errno fcntl.fcntl(sys.stdin.fileno(),fcntl.F_SETFL,os.O_NONBLOCK) rfs=[sys.stdin.fileno()] xfs=rfs+[sys.stdout.fileno()] buf="" lnum=0 timeout=None rd=True while rd: rl,_,xl=select.select(rfs,(),xfs,timeout) if xl: raise IOError # "exception" occurred (TCP OOB data?) if rl: try: rd=sys.stdin.read() # read whatever we have except IOError as e: # spurious readiness? if e.errno!=errno.EAGAIN: raise # die on other errors else: buf+=rd nl0=0 # previous newline while True: nl=buf.find('\n',nl0) if nl<0: buf=buf[nl0:] # hold partial line for "processing" break lnum+=1 print "%d: %s"%(lnum,buf[nl0:nl]) timeout=0 nl0=nl+1 else: # no input yet sys.stdout.flush() timeout=None if buf: sys.stdout.write("%d: %s"%(lnum+1,buf)) # write any partial last line，我们可以在得到部分行后立即写出它们，但这会保留以表示立即处理整个行。

在我的（压迫式）计算机上，您的cat -n测试采用“真实的0m2.454s用户0m2.144s sys 0m0.504s”。

Answer 2

# -*- coding: utf-8 -*-
import os
import sys
import select
import fcntl
import threading


class StdInput:
    def __init__(self):
        self.close_evt = threading.Event()

        fcntl.fcntl(sys.stdin.fileno(), fcntl.F_SETFL, fcntl.fcntl(sys.stdin.fileno(), fcntl.F_GETFL) | os.O_NONBLOCK);
        self.input = (sys.stdin.original_stdin if hasattr(sys.stdin, "original_stdin") else sys.stdin)
        self.epoll = select.epoll()
        self.epoll.register(sys.stdin.fileno(), select.EPOLLIN | select.EPOLLPRI | select.EPOLLERR | select.EPOLLHUP | select.EPOLLRDBAND)

    def read(self):
        while not self.close_evt.is_set():
            input_line = self.input.readline()
            # If the object is in non-blocking mode and no bytes are available, None is returned.
            if input_line is not None and len(input_line) > 0:
                break           
            print("Nothing yet...")
            evt_lst = self.epoll.poll(1.0)  # Timeout 1s
            print("Poll exited: event list size={}".format(len(evt_lst)))
            if len(evt_lst) > 0:
                assert len(evt_lst) == 1
                if (evt_lst[0][1] & (select.EPOLLERR | select.EPOLLHUP)) > 0:
                    raise Exception("Ooops!!!")
        return input_line


if __name__ == "__main__":
    i = StdInput()

    def alm_handle():
        i.close_evt.set()
    threading.Timer(4, alm_handle).start()

    print("Reading...")
    input_line = i.read()
    print("Read='{}'".format(input_line))

您如何判断sys.stdin.readline（）是否将被阻止？

行缓冲实现

默认缓冲的实现

所需的实现：

线程实现：

验证缓冲行为：

时间：

2 个答案: