我让Scrapy抓取我的网站,找到404响应的链接并将其返回到JSON文件。这非常有效。
但是,我无法弄清楚如何获取该错误链接的所有实例,因为复制过滤器正在捕获这些链接而不是重试它们。
由于我们的网站有数千个页面,这些部分由多个团队管理,我需要能够创建每个部分的错误链接报告,而不是在整个网站上找到一个并进行搜索替换。
非常感谢任何帮助。
我当前的抓取工具:
import scrapy
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors.lxmlhtml import LxmlLinkExtractor
from scrapy.item import Item, Field
# Add Items for exporting to JSON
class DevelopersLinkItem(Item):
url = Field()
referer = Field()
link_text = Field()
status = Field()
time = Field()
class DevelopersSpider(CrawlSpider):
"""Subclasses Crawlspider to crawl the given site and parses each link to JSON"""
# Spider name to be used when calling from the terminal
name = "developers_prod"
# Allow only the given host name(s)
allowed_domains = ["example.com"]
# Start crawling from this URL
start_urls = ["https://example.com"]
# Which status should be reported
handle_httpstatus_list = [404]
# Rules on how to extract links from the DOM, which URLS to deny, and gives a callback if needed
rules = (Rule(LxmlLinkExtractor(deny=([
'/android/'])), callback='parse_item', follow=True),)
# Called back to for each requested page and used for parsing the response
def parse_item(self, response):
if response.status == 404:
item = DevelopersLinkItem()
item['url'] = response.url
item['referer'] = response.request.headers.get('Referer')
item['link_text'] = response.meta.get('link_text')
item['status'] = response.status
item['time'] = self.now.strftime("%Y-%m-%d %H:%M")
return item
我尝试了一些自定义的欺骗过滤器,但最终都没有。
答案 0 :(得分:0)
如果我正确理解您的问题,您的请求将被crawlspider默认过滤。您可以使用Rule类的process_request参数为每个请求设置dont_filter = True(https://doc.scrapy.org/en/latest/topics/spiders.html#scrapy.spiders.Rule)