我正在学习Scrapy抓取您的网站。我想跟踪蜘蛛无法获取的链接/网址。以及如何在蜘蛛完成工作后触发最终任务。
示例代码显示了我要查找的内容。当然这不是真实的生活案例,但我正在学习,所以我想实现这一点。
在单词中可以创建function_to_be_triggered_when_url_is_not_able_to_fetch
之类的功能,可以跟踪哪些蜘蛛无法获取的网址。另一件事是如何创建类似于function_to_be_triggered_when_spider_has_done_its_all_pending_jobs()
的函数,它可用于将中间数据写入文件或数据库,或者在爬行所有域时发送邮件。
这是简单的蜘蛛
class MySpider(CrawlSpider):
name='spider1'
allowed_domains=[i.split('\n')[0] for i in open('url_list.txt','r').readlines()]
start_urls = ['http://'+i.split('\n')[0] for i in open('url_list.txt','r').readlines()]
rules = [Rule(SgmlLinkExtractor(), callback='parse_item',follow=True)]
def __init__(self,category=None, *args, **kwargs):
super(MySpider, self).__init__(*args, **kwargs)
self.count_fetched_urls=0 #to count successfully fetched urls from domain/site
self.count_failed_to_fetch=0 #to count urls which could not fetch because of timeout or 4XX HTTP errors.
def parse_item(self, response):
self.count_fetched_urls=self.count_fetched_urls+1
#some more useful lines to process fetched urls
def function_to_be_triggered_when_url_is_not_able_to_fetch(self):
self.count_failed_to_fetch=self.count_failed_to_fetch+1
print self.count_failed_to_fetch,'urls are failed to fetch till now'
def function_to_be_triggered_when_spider_has_done_its_all_pending_jobs():
print 'Total domains/site:',len(self.start_urls)
print 'total links/urls spider faced',self.count_fetched_urls+self.count_failed_to_fetch
print 'Successfully fetch urls/links:',self.count_fetched_urls
print 'Failed to fetch urls/links:',self.count_failed_to_fetch