因此,看来,https://packetstream.io/一直无处不在我的蜘蛛使用代理服务的地方。我联系了他们,他们说他们没有受到任何服务中断。我不断收到错误消息:
Retrying <GET https://www.oddschecker.com/us/boxing-mma> (failed 2 times): User timeout caused connection failure: Getting https://www.oddschecker.com/us/boxing-mma took longer than 180.0 seconds..
中间件:
DOWNLOADER_MIDDLEWARES = {
'scrapy_splash.SplashCookiesMiddleware': 723,
'scrapy_splash.SplashMiddleware': 725,
'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware': 810,
'sfb.middlewares.SurefirebettingDownloaderMiddleware': 543,
'sfb.middlewares.CustomProxyMiddleware': 350,
'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware': 400,
'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware': None,
'scrapy_useragents.downloadermiddlewares.useragents.UserAgentsMiddleware': 500
}
设置:
class CustomProxyMiddleware(object):
def process_request(self, request, spider):
request.meta["proxy"] = "https://proxy.packetstream.io:port"
request.headers["Proxy-Authorization"] = basic_auth_header("username",
"API key")
蜘蛛:
class OddscheckerSpider(scrapy.Spider):
name = 'oddschecker'
allowed_domains = []
start_urls = ["https://www.oddschecker.com/us/boxing-mma"]
def parse(self, response):
soup = BeautifulSoup(response.text, "lxml")
这不像我的代理服务器刚刚被该网站禁止,因为现在使用代理服务时,我的所有蜘蛛都无法工作。但是,如果我注释掉代理设置和中间件,它就可以正常工作。有什么想法吗?