我正在尝试访问某个网站,并检查是否没有链接重定向到网站中已关闭的网页。由于没有可用的站点地图,我使用Scrapy抓取网站并获取每个页面上的所有链接,但我无法输出包含找到的所有链接及其状态代码的文件。我用来测试代码的网站是quotes.toscrape.com,我的代码是:
from scrapy.spiders import Spider
from mytest.items import MytestItem
from scrapy.http
import Request
import re
class MySpider(Spider):
name = "sample"
allowed_domains = ["quotes.toscrape.com"]
start_urls = ["http://quotes.toscrape.com"]
def parse(self, response):
links = response.xpath('//a/@href').extract()
\# We stored already crawled links in this list
crawledLinks = []
for link in links:
\# If it is a proper link and is not checked yet, yield it to the Spider
if link not in crawledLinks:
link = "http://quotes.toscrape.com" + link
crawledLinks.append(link)
yield Request(link, self.parse)
我在yield之后尝试添加以下几行:
item = MytestItem()
item['url'] = link
item['status'] = response.status
yield item
但它给了我一堆重复,没有状态为404或301的网址。有谁知道如何获得所有状态的网址?
答案 0 :(得分:2)
默认情况下,Scrapy不会返回任何不成功的请求,但是如果设置errback
on the request,则可以在其中一个函数中获取它们并处理它们。
def parse(self, response):
# some code
yield Request(link, self.parse, errback=self.parse_error)
def parse_error(self, failure):
# log the response as an error
参数failure
将包含more information on the exact reason失败,因为它可能是HTTP错误(您可以获取响应),但也可能是DNS查找错误等(没有响应)。
该文档包含一个示例,说明如何使用失败来确定错误原因并访问Response
如果可用:
def errback_httpbin(self, failure):
# log all failures
self.logger.error(repr(failure))
# in case you want to do something special for some errors,
# you may need the failure's type:
if failure.check(HttpError):
# these exceptions come from HttpError spider middleware
# you can get the non-200 response
response = failure.value.response
self.logger.error('HttpError on %s', response.url)
elif failure.check(DNSLookupError):
# this is the original request
request = failure.request
self.logger.error('DNSLookupError on %s', request.url)
elif failure.check(TimeoutError, TCPTimedOutError):
request = failure.request
self.logger.error('TimeoutError on %s', request.url)
答案 1 :(得分:2)
您应该在设置中使用HTTPERROR_ALLOW_ALL或在所有请求中设置元键handle_httpstatus_all = True
,有关详细信息,请参阅文档。