我正在尝试使用scrapy从copyright.gov网站上获取所有音乐文件类型,但我一直收到此错误:
User timeout caused connection failure: Getting http://cocatalog.loc.gov/cgi-bin/Pwebrecon.cgi?PID=JADxIm18gK9YX6t-BSYC9oABskwhR&SEQ=20150331032850&CNT=25&HIST=1&Search_Arg=PAu003%3F&Search_Code=FT%2A took longer than 180 seconds..
我知道它对网站有一些限制(即使进行手动搜索也会导致网站超时。这是我的蜘蛛:
from scrapy.spider import BaseSpider
from scrapy.selector import HtmlXPathSelector
from datetime import datetime
from scrapy.http import FormRequest, Request
from scrapy.utils.response import open_in_browser
class CopyrightSpider(BaseSpider):
name = "copyright_records"
start_urls = ["http://cocatalog.loc.gov/cgi-bin/Pwebrecon.cgi?DB=local&PAGE=First"]
def parse(self, response):
yield FormRequest.from_response(response,
formname='querybox',
formdata={'Search_Arg': 'music?', 'Search_Code': 'FT*'},
cookies={'s_sess':'%20s_cc%3Dtrue%3B%20s_sq%3D%3B', 's_vi':'[CS]v1|2A8CD884851D46DB-400019054027B53D[CE]'},
callback=self.parse1)
def parse1(self, response):
open_in_browser(response)
有没有解决这个超时问题的方法?
答案 0 :(得分:0)
在settings.py中设置DOWNLOAD_TIMEOUT,默认值为180秒
DOWNLOAD_TIMEOUT = 360 // for six minutes.