如果在scrapy中抓取了一定量的页面,请转到下一个URL

时间:2017-11-01 16:56:34

标签: python scrapy

我已经找到了这个问题的两个答案,但没有一个对我有用。基本上,我想限制每个域抓取的页面数量。这是实际抓取工具中的代码:

def parse_page(self, response)
    visited_count.append(response.url.split('/')[2])
        if visited_count.count(response.url.split('/')[2]) > 49:
            print '!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!'
            denied.append(response.url)

和自定义中间件:

class IgnoreDomain(object):
    def process_requests(request, spider):
        if request in spider.denied:
            return IgnoreRequest()
        else:
            return None  

当然,设置中会提到中间件。 如果你能指出我做错了什么,我真的很感激。

1 个答案:

答案 0 :(得分:1)

你说I want to restrict the amount of pages crawled per domain ......

这样做,在你的蜘蛛中创建counter

class YourSpider(scrapy.Spider):
    counter = {}
    #counter will have values like {'google.com': 4, 'website.com': 2} etc

在中间件文件的顶部写下这个

from scrapy.exceptions import IgnoreRequest

import logging

class YourMiddleware(object):

    def process_request(self, request, spider):

        domain = tldextract.extract(request.url)[1]
        logging.info(spider.counter)
        if domain not in spider.counter:
               pass #keep scraping this link
        else:
               if spider.counter[domain] > 5:
                    raise IgnoreRequest()
               else:
                    pass #keep processing this request

    def process_response(self, request, response, spider):    

        domain = tldextract.extract(request.url)[1]

        if domain not in spider.counter:
               spider.counter[domain] = 1
        else:
               spider.counter[domain] = spider.counter[domain] + 1

        return response