Question

我正在使用tornado.httpclient.AsyncHTTPClient.fetch从列表中获取域名。当我以一些较大的间隔（例如500）获取域时，所有工作都很好，但是当我将inerval减少到100时，下一个异常会不时发生：


Traceback (most recent call last):
  File "/home/crchemist/python-2.7.2/lib/python2.7/site-packages/tornado/simple_httpclient.py", line 289, in cleanup
    yield
  File "/home/crchemist/python-2.7.2/lib/python2.7/site-packages/tornado/stack_context.py", line 183, in wrapped
    callback(*args, **kwargs)
  File "/home/crchemist/python-2.7.2/lib/python2.7/site-packages/tornado/simple_httpclient.py", line 384, in _on_chunk_length
    self._on_chunk_data)
  File "/home/crchemist/python-2.7.2/lib/python2.7/site-packages/tornado/iostream.py", line 180, in read_bytes
    self._check_closed()
  File "/home/crchemist/python-2.7.2/lib/python2.7/site-packages/tornado/iostream.py", line 504, in _check_closed
    raise IOError("Stream is closed")
IOError: Stream is closed

这种行为的原因是什么？代码如下所示：


def fetch_domain(domain):
    http_client = AsyncHTTPClient()
    request = HTTPRequest('http://' + domain,
       user_agent=CRAWLER_USER_AGENT)
    http_client.fetch(request, handle_domain)


class DomainFetcher(object):
    def __init__(self, domains_iterator):
        self.domains = domains_iterator

    def __call__(self):
        try:
            domain = next(self.domains)
        except StopIteration:
            domain_generator.stop()
            ioloop.IOLoop.instance().stop()
        else:
            fetch_domain(domain)

domain_generator = ioloop.PeriodicCallback(DomainFetcher(domains), 500)
domain_generator.start()

Answer 1

请注意，tornado.ioloop.PeriodicCallback takes a cycle time in integer ms HTTPRequest对象配置为connect_timeout和/或request_timeout浮点秒（see doc）。

“浏览互联网的用户认为，当从点击到响应的延迟小于100毫秒时，响应是”即时的“”（from wikipedia）请参阅this ServerFault question for normal latency values。

IOError: Stream is closed有效地被提出来通知您，您的连接超时没有完成，或者更准确地说，您在尚未打开的管道上手动调用了回调。这是好的，因为延迟不是异常的> 100毫秒;如果您希望可靠地完成提取，则应提高此值。

一旦你的超时设置为理智，考虑将你的提取包装在try / except重试循环中，因为这是你可以在生产中发生的正常异常。小心设置重试限制！

由于您使用的是异步框架，为什么不让它处理异步回调本身而不是在固定的时间间隔内运行所述回调？ Epoll/kqueue are efficient and supported by this framework.

import ioloop

def handle_request(response):
    if response.error:
        print "Error:", response.error
    else:
        print response.body
    ioloop.IOLoop.instance().stop()

http_client = httpclient.AsyncHTTPClient()
http_client.fetch("http://www.google.com/", handle_request)
ioloop.IOLoop.instance().start()

^逐字复制from the doc。

如果你走这条路线，唯一的问题是编写你的请求队列，以便你有一个最大的开放连接强制执行。否则，在进行严肃的刮擦时，你很可能会遇到竞争状态。

自从我自己触及龙卷风以来已经是1年了，所以如果这个回复中有不准确之处，请告诉我，我会修改。

Answer 2

看起来你正在编写类似网络爬虫的东西。您的问题是由超时直接造成的，但在深层次，与龙卷风中的并行模式有关。

当然，龙卷风中的AsyncHTTPClient可以自动对请求进行排队。实际上，AsyncHTTPClient将批量发送10个请求（默认情况下），并阻止等待其结果，然后发送下一批。批处理中的请求是非块并且并行处理，但它是批处理之间的块。并且在请求完成后不会立即调用每个请求的回调，但是在该批请求完成之后再调用10个回调。

回到您的问题，您无需使用ioloop.PeriodicCallback逐步发出请求，因为龙卷风中的AsyncHTTPClient可以自动对请求进行排队。您可以一次分配所有请求，让AsyncHTTPClient来安排请求。

但问题是等待队列中的请求仍然消耗超时时间！因为请求在批次之间是阻止的。以后的请求只是在这里阻塞，并逐批发送，而不是将它们放在一个特殊的就绪队列中，并在响应到达后发送新的请求。

因此，如果安排了多个请求，则设置为20s的默认超时时间太短。如果您只是进行演示，可以直接将超时设置为float('inf')。如果做一些严肃的事情，你必须使用try / except重试循环。

您可以在tornado/httpclient.py找到如何设置超时，引用此处。

connect_timeout：以秒为单位的初始连接超时
request_timeout：以秒为单位的整个请求超时

最后，我编写了一个简单的程序，使用AsyncHTTPClient从ZJU在线判断系统中获取数千页。您可以尝试这个，然后重写到您的爬虫。在我的网络上，它可以在2分钟内获取2800页。非常好的结果，比串行提取快10倍（完全匹配并行大小）。

#!/usr/bin/env python
from tornado.httpclient import AsyncHTTPClient, HTTPRequest
from tornado.ioloop import IOLoop

baseUrl = 'http://acm.zju.edu.cn/onlinejudge/showProblem.do?problemCode='

start = 1001
end = 3800
count = end - start
done = 0

client = AsyncHTTPClient()

def onResponse(response):
    if response.error:
        print('Error: %s' % response.error)
    else:
        global done
        done += 1
        #It is comment out here, you could uncomment it and watch something interest, that len(client.queue) is reduce 10 by 10.
        #print('Queue length %s, Active client count %s, Max active clients limit %s' % (len(client.queue), len(client.active), client.max_clients))
        print('Received %s, Content length %s, Done %s' % (response.effective_url[-4:], len(response.body), done))
        if(done == count):
            IOLoop.instance().stop()

for i in range (start, end):
    request = HTTPRequest(baseUrl + str(i), connect_timeout=float('inf'), request_timeout=float('inf'))
    client.fetch(request, onResponse)
    print('Generated %s' % i)

IOLoop.instance().start()

附加：

如果您有足够的网页要抓取，并且您是追求最佳效果的人，那么您可以查看Twisted。我用Twisted编写了一个相同的程序并将其粘贴到我的Gist上。它的结果非常棒：在40秒内获取2800页。

tornado AsyncHTTPClient.fetch异常

2 个答案:

附加：