Question

在Scrapy中，如何为允许的域和拒绝的域使用不同的回调函数。

我正在使用以下规则：

rules = [Rule(LinkExtractor(allow=(), deny_domains = allowed_domains), callback='parse_denied_item', follow=True),
Rule(LinkExtractor(allow_domains = allowed_domains), callback='parse_item', follow=True)]

基本上，我希望在parse_item（或其中一个域的子域）发出请求时调用allowed_domain。然后，我希望为parse_denied_item未列入白名单的所有请求调用allowed_domains。

我该怎么做？

Answer 1

我认为最好的方法是不在allowed_domains上使用LinkExtractor，而是从response.url中的parse_*中解析域方法，并根据域执行不同的逻辑。

您可以保留单独的parse_*方法和分类方法，这些方法根据域的不同，使用相应的yield from self.parse_*(response)方法调用parse_*（Python 3）：

rules = [Rule(LinkExtractor(), callback='parse_all', follow=True)]

def parse_all(self, response):
    # [Get domain out of response.url]
    if domain in allowed_domains:
        yield from self.parse_item(response)
    else:
        yield from self.parse_denied_item(response)

Answer 2

基于Gallaecio的回答。另一种选择是使用process_request中的Rule。 process_request将在发送请求之前捕获该请求。

根据我的理解（可能是错误的），Scrapy将仅对self.allowed_domains中列出的域进行爬网（假设使用了该域）。但是，如果在抓取的页面上遇到非现场链接，则在某些情况下，Scrapy将向该非现场链接发送单个请求[1]。我不确定为什么会这样。我认为这可能是因为目标网站执行了301或302重定向，并且搜寻器自动跟随该URL。否则，可能是一个错误。

process_request可用于在请求执行之前对请求执行处理。就我而言，我想记录所有未爬网的链接。因此，在继续之前，我要验证允许的域位于request.url中，并记录所有不在其中的域。

这里是一个例子：

rules = [Rule(LinkExtractor(), callback='parse_item', process_request='process_item', follow=True)]

def process_item(self, request):
    found = False
    for url in self.allowed_domains:
        if url in request.url:
            #an allowed domain is in the request.url, proceed
            found = True

    if found == False: #otherwise log it
        self.logDeniedDomain(urlparse(request.url).netloc)

        # according to: https://docs.scrapy.org/en/latest/topics/spiders.html#scrapy.spiders.Rule
        # setting request to None should prevent this call from being executed (which is not the case for all)
        # middleware is used to catch these few requests
        request = None

    return request

[1]：如果您遇到此问题，则可以在Downloader中间件中使用process_request来解决。

我的Downloader中间件：

def process_request(self, request, spider):
    #catch any requests that should be filtered, and ignore them
    found = False
    for url in spider.allowed_domains:
        if url in request.url:
            #an allowed domain is in the request.url, proceed
            found = True

    if found == False:
        print("[ignored] "+request.url)
        raise IgnoreRequest('Offsite link, ignore')

    return None

确保也导入IgnoreRequest：

from scrapy.exceptions import IgnoreRequest

并在settings.py中启用Downloader中间件。

要对此进行验证，您可以在抓取工具的process_item中添加一些验证码，以确保未对超出范围的网站提出任何请求。

Scrapy规则，allowed_domains的回调和拒绝域的不同回调

2 个答案: