如何跟踪无法在Scrapy中获取网址?

时间:2014-03-07 07:22:46

标签: python web-crawler scrapy

我正在学习Scrapy抓取您的网站。我想跟踪蜘蛛无法获取的链接/网址。以及如何在蜘蛛完成工作后触发最终任务。

示例代码显示了我要查找的内容。当然这不是真实的生活案例,但我正在学习,所以我想实现这一点。

在单词中可以创建function_to_be_triggered_when_url_is_not_able_to_fetch之类的功能,可以跟踪哪些蜘蛛无法获取的网址。另一件事是如何创建类似于function_to_be_triggered_when_spider_has_done_its_all_pending_jobs()的函数,它可用于将中间数据写入文件或数据库,或者在爬行所有域时发送邮件。

这是简单的蜘蛛

class MySpider(CrawlSpider):
    name='spider1'
    allowed_domains=[i.split('\n')[0] for i in open('url_list.txt','r').readlines()]
    start_urls = ['http://'+i.split('\n')[0] for i in open('url_list.txt','r').readlines()]
    rules = [Rule(SgmlLinkExtractor(), callback='parse_item',follow=True)]

    def __init__(self,category=None, *args, **kwargs):
        super(MySpider, self).__init__(*args, **kwargs)
        self.count_fetched_urls=0 #to count successfully fetched urls from domain/site
        self.count_failed_to_fetch=0 #to count urls which could not fetch because of timeout or 4XX HTTP errors.

    def parse_item(self, response):
        self.count_fetched_urls=self.count_fetched_urls+1
        #some more useful lines to process fetched urls

    def function_to_be_triggered_when_url_is_not_able_to_fetch(self):
        self.count_failed_to_fetch=self.count_failed_to_fetch+1
        print self.count_failed_to_fetch,'urls are failed to fetch till now'

    def function_to_be_triggered_when_spider_has_done_its_all_pending_jobs():
        print 'Total domains/site:',len(self.start_urls)
        print 'total links/urls spider faced',self.count_fetched_urls+self.count_failed_to_fetch
        print 'Successfully fetch urls/links:',self.count_fetched_urls
        print 'Failed to fetch urls/links:',self.count_failed_to_fetch

0 个答案:

没有答案