Question

我使用此one作为我的脚本的基础来获取所有页面的状态。但是如何检查这些页面上引用的文件（图像，可下载[pdf，doc]，CSS等）是否存在？我是Python的新手，这是我目前的代码：

class BrokenItem(Item):
url = Field()
referer = Field()
status = Field()
type = Field()

class BrokenLinksSpider(CrawlSpider):
name = config.name
allowed_domains = config.allowed_domains
start_urls = config.start_urls
handle_httpstatus_list = [404]
rules = (Rule(SgmlLinkExtractor(), callback='parse_item', follow=True),)

def parse_item(self, response):
        item = BrokenItem()
        item['url'] = response.url
        item['referer'] = response.request.headers.get('Referer')
        item['status'] = response.status
        item['type'] = 'page'

        return item

Answer 1

您需要执行HEAD请求。这些请求不会下载仅解释内容的内容标题。

例如，您可以查看：

if response.headers['Content-Length'] > 0:  
    #check whether the body is not empty
if 'image' in response.headers['Content-Type']:
    # check whether the content is an image

在创建method='HEAD'时使用Request关键字参数在scrapy中发出HEAD请求，即Request('http://scrapy.org',method='HEAD')

现在你需要弄清楚如何检查你想要的所有字段，根据所需的速度和复杂性，有很多方法可以做到这一点。最简单的方法可能是蜘蛛本身的链请求或检查scrapy中的每个字段item pipelines.

如何使用Scrapy获取具有网站状态的所有页面和文件？

1 个答案: