如何直接发送twisted.internet.error.TimeoutError而不请求

时间:2017-05-22 11:54:38

标签: python python-2.7 scrapy timeout

我有一个包含数千URL个数据库的数据库,我使用一个蜘蛛进行搜索。例如,100 URL可以具有相同的域:

http://notsame.com/1
http://notsame2.com/1
http://dom.com/1
http://dom.com/2
http://dom.com/3
...

问题是有时网页/域名什么都没有返回,所以我得到了<twisted.python.failure.Failure twisted.internet.error.TimeoutError: User timeout caused connection failure:。对于域的所有URL,这是相同的。

我想检测Timeout,例如同一个域的5个网址,然后如果我确定此主机有问题,请避免再次请求此域并直接引发<twisted.python.failure.Failure twisted.internet.error.TimeoutError: User timeout caused connection failure:

有可能吗?如果是,怎么样?

修改

我的想法(用rrschmidt的帮助编辑):

class TimeoutProcessMiddleware:
     _timeouted_domains = set()

    def process_request(request,spider):
        domain = get_domain(request.url)
        if domain in _timeouted_domains:
            return twisted.internet.error.TimeoutError
        return request

    def process_response(request, exception, spider):
        # left out the code for counting timeouts for clarity
        if is_timeout_exception(exception):
            self._timeouted_domains.add(get_domain(request.url))

1 个答案:

答案 0 :(得分:0)

您正在构建var data = { "directed": false, "graph": {}, "nodes": [{"id": 0}, {"id": 1}, {"id": 2}, {"id": 3}, {"id": 4}, {"id": 5}, {"id": 6}, {"id": 7}, {"id": 8}, {"id": 9}, {"id": 10}, {"id": 11}, {"id": 12}], "links": [{"source": 0, "target": 0}, {"source": 1, "target": 1}, {"source": 2, "target": 2}, {"source": 3, "target": 3}, {"source": 4, "target": 4}, {"source": 5, "target": 5}, {"source": 6, "target": 6}, {"source": 7, "target": 8}, {"source": 7, "target": 9}, {"source": 7, "target": 10}, {"source": 7, "target": 11}, {"source": 7, "target": 7}, {"source": 8, "target": 8}, {"source": 9, "target": 9}, {"source": 10, "target": 10}, {"source": 11, "target": 11}, {"source": 12, "target": 12}], "multigraph": false }; let filteredNodes = data.nodes.filter( function(n) { return this.has(n.id); }, data.links.reduce((set, {source:s, target:t}) => s !== t ? set.add(s).add(t) : set, new Set() ) ); console.log(filteredNodes);的想法。更具体地说,我会将其构建为下载中间件。

下载器中间件可以触摸每个传出请求以及每个传入的响应......并且......它还可以处理在处理请求/响应时弹出的每个异常。详情:https://doc.scrapy.org/en/latest/topics/downloader-middleware.html

所以我会做什么(在未经测试的要点中,可能需要一些微调):

TimeoutProcessMiddleware