平台:debian 8 + python 3.4 + Scrapy 1.3.2 这是我的蜘蛛从yahoo.com下载一些网址
import scrapy
import csv
class TestSpider(scrapy.Spider):
name = "quote"
allowed_domains = ["yahoo.com"]
start_urls = ['url1','url2','url3',,,,'urls100']
def parse(self, response):
filename = response.url.split("=")[1]
open('/tmp/'+filename+'.csv', 'wb').write(response.body)
执行时会出现一些错误信息:
2017-02-19 21:28:27 [scrapy.spidermiddlewares.httperror] INFO: Ignoring response
<404 https://chart.yahoo.com/table.csv?s=GLU>: HTTP status code is not handled or not allowed
https://chart.yahoo.com/table.csv?s=GLU 是start_urls之一。
现在我想抓住错误信息。
import scrapy
import csv
import logging
from scrapy.utils.log import configure_logging
configure_logging(install_root_handler=False)
logging.basicConfig(
filename='/tmp/log.txt',
format='%(levelname)s: %(message)s',
level=logging.INFO
)
class TestSpider(scrapy.Spider):
name = "quote"
allowed_domains = ["yahoo.com"]
start_urls = ['url1','url2','url3',,,,'url100']
def parse(self, response):
filename = response.url.split("=")[1]
open('/tmp/'+filename+'.csv', 'wb').write(response.body)
为什么错误信息如
2017-02-19 21:28:27 [scrapy.spidermiddlewares.httperror]信息:忽略回复
&lt; 404 https://chart.yahoo.com/table.csv?s=GLU&gt;:未处理或不允许HTTP状态代码
无法记录在/home/log.txt?
想想eLRuLL,我添加了 handle_httpstatus_list = [404] 。
import scrapy
import csv
import logging
from scrapy.utils.log import configure_logging
configure_logging(install_root_handler=False)
logging.basicConfig(
filename='/home/log.txt',
format='%(levelname)s: %(message)s',
level=logging.INFO
)
class TestSpider(scrapy.Spider):
handle_httpstatus_list = [404]
name = "quote"
allowed_domains = ["yahoo.com"]
start_urls = ['url1','url2','url3',,,,'url100']
def parse(self, response):
filename = response.url.split("=")[1]
open('/tmp/'+filename+'.csv', 'wb').write(response.body)
错误信息仍然无法记录到/home/log.txt文件中,为什么?
答案 0 :(得分:0)
使用蜘蛛上的handle_httpstatus_list
属性来处理404
状态:
class TestSpider(scrapy.Spider):
handle_httpstatus_list = [404]