Question

我有以下规则

    Rule(SgmlLinkExtractor(allow=r'.*?', deny=r'/preferences')),
    Rule(SgmlLinkExtractor(allow=r'.*?', deny=r'/auth')),follow=True),

但我在日志中看到以下内容。我也试过没有允许=它仍然是相同的。我是否需要忽略来自中间件的这些URL？

014-01-08 21：31：07 + 0100 [mybot] DEBUG：Crawled（200）http://mydomain.com/preferences/language?continue_to=xxxxx> （引用者：http://mydomain.com/categories/something-something-something）

Answer 1

来自Scrapy docs：

如果多个规则匹配相同的链接，将使用第一个，根据他们在这个属性中定义的顺序。

因此，您正在匹配一个或另一个规则中的任何URL。只需将所有拒绝规则合并到一个：

Rule(SgmlLinkExtractor(deny = (r'\/preferences', r'\/auth')))

很少注意到：

拒绝规则是正则表达式模式，因此您应该使用\/。
匹配是在绝对网址上进行的，因此我使用http:\/\/mydomain.com\/preferences

follow = True

callback

Rule

Scrapy正在跳过我的拒绝规则

1 个答案: