日志文件中的垃圾日志消息

时间:2014-06-26 10:02:25

标签: python django web-scraping scrapy

这是我的斗志

的设置
LOG_ENABLED = True
STATS_ENABLED = True
LOG_FILE = 'crawl.log'

我的蜘蛛是......

class AbcSpider(XMLFeedSpider):
    handle_httpstatus_list = [404, 500]
    name = 'abctv'
    allowed_domains = ['abctvnepal.com.np']
    start_urls = [
        'http://www.abctvnepal.com.np',
    ]

    def parse(self, response):

        mesg = "Spider {} is not working".format(name)

        if response.status in self.handle_httpstatus_list:
            return log.msg(mesg, level=log.ERROR)

        hxs = HtmlXPathSelector(response) # The XPath selector
        sites = hxs.select('//div[@class="marlr respo-left"]/div/div/h3')
        items = []
        for site in sites:
            item = NewsItem()
            item['title'] = escape(''.join(site.select('a/text()').extract())).strip()
            item['link'] = escape(''.join(site.select('a/@href').extract())).strip()
            item['description'] = escape(''.join(site.select('p/text()').extract()))
            item = Request(item['link'],meta={'item': item},callback=self.parse_detail)
            items.append(item)
        return items

    def parse_detail(self, response):
        item = response.meta['item']
        sel = HtmlXPathSelector(response)
        details = sel.select('//div[@class="entry"]/p/text()').extract()
        detail = ''
        for piece in details:
            detail = detail + piece
        item['details'] = detail
        item['location'] = detail.split(",",1)[0]
        item['published_date'] = (detail.split(" ",1)[1]).split(" ",1)[0]+' '+((detail.split(" ",1)[1]).split(" ",1)[1]).split(" ",1)[0]     
        return item

如果响应代码在handle_httpstatus_list = [404, 500]中,我想发送一条日志消息。任何人都可以举例说明怎么做?会有所帮助。

1 个答案:

答案 0 :(得分:1)

scrapy documentation编写得很好,包含很多示例代码。如果您正在开展第一个scrapy项目,那么在那里进行浏览是值得的。 :)

例如,快速扫描logging documentation会显示以下示例代码:

from scrapy import log
log.msg("This is a warning", level=log.WARNING)

因此,添加导入并删除return应修复代码

另外,如果mesg行使用self.name

mesg = "Spider {} is not working".format(self.name)