如何让scrapy忽略或跳过页面/链接?

时间:2014-01-04 05:14:37

标签: python scrapy

我尝试了以下内容,但这些页面仍在被抓取

rules = (
          Rule(SgmlLinkExtractor(deny=r'/preferences'), follow=False),
          Rule(SgmlLinkExtractor(deny=r'/auth'), follow=False),

        )

我做错了什么?

我也尝试过这个中间件

class URLFilterMiddleware(object):
    def process_request(self, request, spider):
        pr
        skip_urls = ['/auth', '/preferences']
        for bad_url in skip_urls:
            if bad_url in request.url:
               return IgnoreRequest()
            else:
               return request

0 个答案:

没有答案