Question

我让Scrapy抓取我的网站，找到404响应的链接并将其返回到JSON文件。这非常有效。

但是，我无法弄清楚如何获取该错误链接的所有实例，因为复制过滤器正在捕获这些链接而不是重试它们。

由于我们的网站有数千个页面，这些部分由多个团队管理，我需要能够创建每个部分的错误链接报告，而不是在整个网站上找到一个并进行搜索替换。

非常感谢任何帮助。

我当前的抓取工具：

import scrapy
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors.lxmlhtml import LxmlLinkExtractor
from scrapy.item import Item, Field

# Add Items for exporting to JSON
class DevelopersLinkItem(Item):
    url = Field()
    referer = Field()
    link_text = Field()
    status = Field()
    time = Field()

class DevelopersSpider(CrawlSpider):
    """Subclasses Crawlspider to crawl the given site and parses each link to JSON"""

    # Spider name to be used when calling from the terminal
    name = "developers_prod"

    # Allow only the given host name(s)
    allowed_domains = ["example.com"]

    # Start crawling from this URL
    start_urls = ["https://example.com"]

    # Which status should be reported
    handle_httpstatus_list = [404]

    # Rules on how to extract links from the DOM, which URLS to deny, and gives a callback if needed
    rules = (Rule(LxmlLinkExtractor(deny=([
        '/android/'])), callback='parse_item', follow=True),)

    # Called back to for each requested page and used for parsing the response
    def parse_item(self, response):
        if response.status == 404:
            item = DevelopersLinkItem()
            item['url'] = response.url
            item['referer'] = response.request.headers.get('Referer')
            item['link_text'] = response.meta.get('link_text')
            item['status'] = response.status
            item['time'] = self.now.strftime("%Y-%m-%d %H:%M")

            return item

我尝试了一些自定义的欺骗过滤器，但最终都没有。

Answer 1

如果我正确理解您的问题，您的请求将被crawlspider默认过滤。您可以使用Rule类的process_request参数为每个请求设置dont_filter = True（https://doc.scrapy.org/en/latest/topics/spiders.html#scrapy.spiders.Rule）

使用scrapy获取404错误的所有实例

1 个答案: