Question

平台：debian 8 + python 3.4 + Scrapy 1.3.2 这是我的蜘蛛从yahoo.com下载一些网址

import scrapy  
import csv

class TestSpider(scrapy.Spider):  
    name = "quote"  
    allowed_domains = ["yahoo.com"] 
    start_urls = ['url1','url2','url3',,,,'urls100']

    def parse(self, response):  
        filename = response.url.split("=")[1]  
        open('/tmp/'+filename+'.csv', 'wb').write(response.body)

执行时会出现一些错误信息：

2017-02-19 21:28:27 [scrapy.spidermiddlewares.httperror] INFO: Ignoring response 
<404 https://chart.yahoo.com/table.csv?s=GLU>: HTTP status code is not handled or not allowed

https://chart.yahoo.com/table.csv?s=GLU 是start_urls之一。

现在我想抓住错误信息。

import scrapy
import csv

import logging
from scrapy.utils.log import configure_logging
configure_logging(install_root_handler=False)
logging.basicConfig(
    filename='/tmp/log.txt',
    format='%(levelname)s: %(message)s',
    level=logging.INFO
)

class TestSpider(scrapy.Spider):
    name = "quote"
    allowed_domains = ["yahoo.com"]
    start_urls = ['url1','url2','url3',,,,'url100']

    def parse(self, response):
        filename = response.url.split("=")[1]
        open('/tmp/'+filename+'.csv', 'wb').write(response.body)

为什么错误信息如
2017-02-19 21:28:27 [scrapy.spidermiddlewares.httperror]信息：忽略回复＆lt; 404 https://chart.yahoo.com/table.csv?s=GLU＆gt;：未处理或不允许HTTP状态代码 无法记录在/home/log.txt？

想想eLRuLL，我添加了 handle_httpstatus_list = [404] 。

import scrapy
import csv

import logging
from scrapy.utils.log import configure_logging
configure_logging(install_root_handler=False)
logging.basicConfig(
    filename='/home/log.txt',
    format='%(levelname)s: %(message)s',
    level=logging.INFO
)

class TestSpider(scrapy.Spider):
    handle_httpstatus_list = [404]
    name = "quote"
    allowed_domains = ["yahoo.com"]
    start_urls = ['url1','url2','url3',,,,'url100']

    def parse(self, response):
        filename = response.url.split("=")[1]
        open('/tmp/'+filename+'.csv', 'wb').write(response.body)

错误信息仍然无法记录到/home/log.txt文件中，为什么？

Answer 1

使用蜘蛛上的handle_httpstatus_list属性来处理404状态：

class TestSpider(scrapy.Spider):
    handle_httpstatus_list = [404]

为什么错误信息无法记录到指定文件中？

1 个答案: