当我尝试执行“抓取https://www.sunnysports.com/robots.txt”时,总是会收到超时错误
错误消息
DEBUG: Retrying <GET https://www.sunnysports.com/robots.txt> (failed 2 times): User timeout caused connection failure: Getting https://www.sunnysports.com/robots.txt took longer than 180.0 seconds..
但是我可以使用curl -v或urllib2获取内容。我试图对齐请求标头ex。让scrapy请求标头与curl相同,或者让curl请求标头与scrapy相同。卷毛始终有效,但抓痒总是失败。
Python2.7测试代码
import urllib2
req = urllib2.Request('https://www.sunnysports.com/robots.txt')
response = urllib2.urlopen(req)
the_page = response.read()
我的拼凑版本
$scrapy version -v
Scrapy : 2.1.0
lxml : 4.5.1.0
libxml2 : 2.9.10
cssselect : 1.1.0
parsel : 1.6.0
w3lib : 1.22.0
Twisted : 20.3.0
Python : 3.6.0 (v3.6.0:41df79263a11, Dec 22 2016, 17:23:13) - [GCC 4.2.1 (Apple Inc. build 5666) (dot 3)]
pyOpenSSL : 19.1.0 (OpenSSL 1.1.1g 21 Apr 2020)
cryptography : 2.9.2
Platform : Darwin-18.7.0-x86_64-i386-64bit