Scrapy:CrawlSpider会忽略__init__中设置的规则

时间:2016-09-16 12:57:47

标签: python scrapy scrapy-spider

我已经被困在这几天了,这让我发疯了。

我这样叫我的scrapy蜘蛛:

scrapy crawl example -a follow_links="True"

我传入“follow_links”标志来确定是否应该删除整个网站,或者只是我在蜘蛛中定义的索引页面。

在spider的构造函数中检查此标志以查看应设置的规则:

def __init__(self, *args, **kwargs):

    super(ExampleSpider, self).__init__(*args, **kwargs)

    self.follow_links = kwargs.get('follow_links')
    if self.follow_links == "True":
        self.rules = (
            Rule(LinkExtractor(allow=()), callback="parse_pages", follow=True),
        )
    else:
        self.rules = (
            Rule(LinkExtractor(deny=(r'[a-zA-Z0-9]*')), callback="parse_pages", follow=False),
        )

如果是“True”,则允许所有链接;如果它是“假”,则所有链接都被拒绝。

到目前为止,这么好,但这些规则被忽略了。我可以获得遵循规则的唯一方法是在构造函数之外定义它们。这意味着,像这样的东西可以正常工作:

class ExampleSpider(CrawlSpider):

    rules = (
        Rule(LinkExtractor(deny=(r'[a-zA-Z0-9]*')), callback="parse_pages", follow=False),
    )

    def __init__(self, *args, **kwargs):
        ...

所以基本上,在__init__构造函数中定义规则会导致规则被忽略,而在构造函数之外定义规则会按预期工作。

我无法理解这一点。我的代码如下。

import re
import scrapy

from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule
from w3lib.html import remove_tags, remove_comments, replace_escape_chars, replace_entities, remove_tags_with_content


class ExampleSpider(CrawlSpider):

    name = "example"
    allowed_domains = ['example.com']
    start_urls = ['http://www.example.com']    
    # if the rule below is uncommented, it works as expected (i.e. follow links and call parse_pages)
    # rules = (
    #     Rule(LinkExtractor(allow=()), callback="parse_pages", follow=True),
    # )

    def __init__(self, *args, **kwargs):

        super(ExampleSpider, self).__init__(*args, **kwargs)

        # single page or follow links
        self.follow_links = kwargs.get('follow_links')
        if self.follow_links == "True":
            # the rule below will always be ignored (why?!)
            self.rules = (
                Rule(LinkExtractor(allow=()), callback="parse_pages", follow=True),
            )
        else:
            # the rule below will always be ignored (why?!)
            self.rules = (
                Rule(LinkExtractor(deny=(r'[a-zA-Z0-9]*')), callback="parse_pages", follow=False),
            )


    def parse_pages(self, response):
        print("In parse_pages")
        print(response.xpath('/html/body').extract())
        return None


    def parse_start_url(self, response):
        print("In parse_start_url")
        print(response.xpath('/html/body').extract())
        return None

感谢您抽出宝贵时间帮助我解决此事。

1 个答案:

答案 0 :(得分:5)

这里的问题是CrawlSpider构造函数(__init__)也在处理rules参数,因此如果您需要分配它们,您必须先执行此操作调用默认构造函数。

换句话说,在致电super(ExampleSpider, self).__init__(*args, **kwargs)之前,请执行您需要的一切:

def __init__(self, *args, **kwargs):
    # setting my own rules
    super(ExampleSpider, self).__init__(*args, **kwargs)