Question

当我尝试执行“抓取https://www.sunnysports.com/robots.txt”时，总是会收到超时错误

错误消息

DEBUG: Retrying <GET https://www.sunnysports.com/robots.txt> (failed 2 times): User timeout caused connection failure: Getting https://www.sunnysports.com/robots.txt took longer than 180.0 seconds..

但是我可以使用curl -v或urllib2获取内容。我试图对齐请求标头ex。让scrapy请求标头与curl相同，或者让curl请求标头与scrapy相同。卷毛始终有效，但抓痒总是失败。

Python2.7测试代码

import urllib2
req = urllib2.Request('https://www.sunnysports.com/robots.txt')
response = urllib2.urlopen(req)
the_page = response.read()

我的拼凑版本

$scrapy version -v
Scrapy       : 2.1.0
lxml         : 4.5.1.0
libxml2      : 2.9.10
cssselect    : 1.1.0
parsel       : 1.6.0
w3lib        : 1.22.0
Twisted      : 20.3.0
Python       : 3.6.0 (v3.6.0:41df79263a11, Dec 22 2016, 17:23:13) - [GCC 4.2.1 (Apple Inc. build 5666) (dot 3)]
pyOpenSSL    : 19.1.0 (OpenSSL 1.1.1g  21 Apr 2020)
cryptography : 2.9.2
Platform     : Darwin-18.7.0-x86_64-i386-64bit

抓取抓取总是超时

0 个答案: