如何在scrapy蜘蛛中添加try异常?

时间:2014-10-28 19:32:11

标签: python scrapy

我使用urllib2和beautifulsoup构建一个简单的爬虫应用程序,现在我打算将其更改为scrapy spider,但是我如何在运行crawler时处理错误, 我当前的应用程序有一些像这样的代码,

error_file = open('errors.txt','a')
finish_file = open('finishlink.txt','a')
try:
    #Code for process each links
    #if sucessfully finished link store into 'finish.txt' file
except Exception as e:
    #link write into 'errors.txt' file with error code

因此,当我处理数千个链接时,成功处理的链接将存储到finish.txt中,错误将存在于errors.txt中,因此我可以稍后运行错误链接,直到成功处理。 那么我如何在这些代码中完成这些,

class DmozSpider(scrapy.Spider):
    name = "dmoz"
    allowed_domains = ["dmoz.org"]
    start_urls = [
        "http://www.dmoz.org/Computers/Programming/Languages/Python/Books/",
        "http://www.dmoz.org/Computers/Programming/Languages/Python/Resources/"
    ]

    def parse(self, response):
        filename = response.url.split("/")[-2]
        with open('filename+'.txt', 'wb') as f:
            f.write(response.body)

1 个答案:

答案 0 :(得分:2)

您可以创建spider middleware并覆盖process_spider_exception()方法,将链接保存在那里的文件中。

蜘蛛中间件只是扩展Scrapy行为的一种方式。 这是一个完整的示例,您可以根据需要进行修改:

from scrapy import signals


class SaveErrorsMiddleware(object):
    def __init__(self, crawler):
        crawler.signals.connect(self.close_spider, signals.spider_closed)
        crawler.signals.connect(self.open_spider, signals.spider_opened)

    @classmethod
    def from_crawler(cls, crawler):
        return cls(crawler)

    def open_spider(self, spider):
        self.output_file = open('somefile.txt', 'a')

    def close_spider(self, spider):
        self.output_file.close()

    def process_spider_exception(self, response, exception, spider):
        self.output_file.write(response.url + '\n')

将其放入模块并在settings.py中设置:

SPIDER_MIDDLEWARES = {
    'myproject.middleware.SaveErrorsMiddleware': 1000,
}

此代码将与您的蜘蛛一起运行,在被占用时触发open_spider(),close_spider(),process_spider_exception()方法。

了解详情: