Question

我是新手，还想构建一个简单的Web搜寻器。不幸的是，如果我使用allowed_domain，则因为该域正在使用相对路径，所以scrapy会过滤掉所有子页面。如何解决？

class ExampleSpider(CrawlSpider):
    name = "example_crawler"
    allowed_domains = ["www.example.com"]
    start_urls = ["https://www.example.com"]

    rules = (
        Rule(LinkExtractor(),
             callback="parse_text",
             follow=True),)

    def parse_text(self, response):
        pass

如果我删除allowed_domains，则所有子页面都会被爬网。但是，如果使用的是允许的域，则由于相对路径问题，所有子页面都将被过滤。这可以解决吗？

Answer 1

允许的域中不得包含www.等。

如果您查看OffsiteMiddleware，它将把allowed_domains中的所有值呈现给正则表达式，然后将您尝试抓取的每个页面都匹配到该正则表达式：

    regex = r'^(.*\.)?(%s)$' % '|'.join(re.escape(d) for d in allowed_domains if d is not None)
    return re.compile(regex)

正则表达式允许子域，因此您可以轻松拥有allowed_domains=['example.com', 'foo.example.com']。如果您留在www.中，则scrapy认为这是一个子域，因此它将在没有它的URL上失败。

如何在相对路径中使用allowed_domains？

1 个答案: