我可以通过Scrapy中的代理成功访问http页面,但我无法访问https网站。我研究了这个话题,但我还不清楚。是否可以通过Scrapy通过代理访问https页面?我需要修补任何东西吗?或者添加一些自定义代码?如果可以确认这是标准功能,我可以跟进更多细节。希望这很简单。
编辑:
以下是我添加到设置文件中的内容:
DOWNLOADER_MIDDLEWARES = {'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware': 110, 'test_website.middlewares.ProxyMiddleware': 100}
PROXIES = [{'ip_port': 'us-il.proxymesh.com:31280', 'user_pass': 'username:password'}]
这是我蜘蛛的代码:
import scrapy
class TestSpider(scrapy.Spider):
name = "test_spider"
allowed_domains = "ipify.org"
start_urls = ["https://api.ipify.org"]
def parse(self, response):
with open('test.html', 'wb') as f:
f.write(response.body)
以下是中间件文件:
import base64
import random
from settings import PROXIES
class ProxyMiddleware(object):
def process_request(self, request, spider):
proxy = random.choice(PROXIES)
if proxy['user_pass'] is not None:
request.meta['proxy'] = "http://%s" % proxy['ip_port']
encoded_user_pass = base64.encodestring(proxy['user_pass'])
request.headers['Proxy-Authorization'] = 'Basic ' + encoded_user_pass
else:
request.meta['proxy'] = "http://%s" % proxy['ip_port']
这是日志文件:
2015-08-12 20:15:50 [scrapy] INFO: Scrapy 1.0.3 started (bot: test_website)
2015-08-12 20:15:50 [scrapy] INFO: Optional features available: ssl, http11
2015-08-12 20:15:50 [scrapy] INFO: Overridden settings: {'NEWSPIDER_MODULE': 'test_website.spiders', 'SPIDER_MODULES': ['test_website.spiders'], 'LOG_STDOUT': True, 'LOG_FILE': 'log.txt', 'BOT_NAME': 'test_website'}
2015-08-12 20:15:51 [scrapy] INFO: Enabled extensions: CloseSpider, TelnetConsole, LogStats, CoreStats, SpiderState
2015-08-12 20:15:53 [scrapy] INFO: Enabled downloader middlewares: ProxyMiddleware, HttpAuthMiddleware, DownloadTimeoutMiddleware, UserAgentMiddleware, RetryMiddleware, DefaultHeadersMiddleware, MetaRefreshMiddleware, HttpCompressionMiddleware, RedirectMiddleware, CookiesMiddleware, ChunkedTransferMiddleware, DownloaderStats
2015-08-12 20:15:53 [scrapy] INFO: Enabled spider middlewares: HttpErrorMiddleware, OffsiteMiddleware, RefererMiddleware, UrlLengthMiddleware, DepthMiddleware
2015-08-12 20:15:53 [scrapy] INFO: Enabled item pipelines:
2015-08-12 20:15:53 [scrapy] INFO: Spider opened
2015-08-12 20:15:53 [scrapy] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2015-08-12 20:15:53 [scrapy] DEBUG: Telnet console listening on 127.0.0.1:6023
2015-08-12 20:15:53 [scrapy] DEBUG: Retrying <GET https://api.ipify.org> (failed 1 times): [<twisted.python.failure.Failure <class 'OpenSSL.SSL.Error'>>]
2015-08-12 20:15:53 [scrapy] DEBUG: Retrying <GET https://api.ipify.org> (failed 2 times): [<twisted.python.failure.Failure <class 'OpenSSL.SSL.Error'>>]
2015-08-12 20:15:53 [scrapy] DEBUG: Gave up retrying <GET https://api.ipify.org> (failed 3 times): [<twisted.python.failure.Failure <class 'OpenSSL.SSL.Error'>>]
2015-08-12 20:15:53 [scrapy] ERROR: Error downloading <GET https://api.ipify.org>: [<twisted.python.failure.Failure <class 'OpenSSL.SSL.Error'>>]
2015-08-12 20:15:53 [scrapy] INFO: Closing spider (finished)
2015-08-12 20:15:53 [scrapy] INFO: Dumping Scrapy stats:
{'downloader/exception_count': 3,
'downloader/exception_type_count/twisted.web._newclient.ResponseNeverReceived': 3,
'downloader/request_bytes': 819,
'downloader/request_count': 3,
'downloader/request_method_count/GET': 3,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2015, 8, 13, 2, 15, 53, 943000),
'log_count/DEBUG': 4,
'log_count/ERROR': 1,
'log_count/INFO': 7,
'scheduler/dequeued': 3,
'scheduler/dequeued/memory': 3,
'scheduler/enqueued': 3,
'scheduler/enqueued/memory': 3,
'start_time': datetime.datetime(2015, 8, 13, 2, 15, 53, 38000)}
2015-08-12 20:15:53 [scrapy] INFO: Spider closed (finished)
如果删除了“https”中的“s”或我禁用了代理,我的蜘蛛就能成功运行。我可以通过浏览器通过代理访问https页面。
答案 0 :(得分:1)
我之所以这样,是因为在代理中间件中使用了base64.encodestring
而不是base64.b64encode
。尝试更改它。
答案 1 :(得分:0)
我认为这是可能的。
如果您通过Request.meta
设置代理,它应该可以正常工作。
如果您使用http_proxy
环境变量设置代理,则可能还需要设置https_proxy
。
但是,您的代理可能不支持HTTPS。
如果您发布了错误,那么帮助您会更容易。
答案 2 :(得分:0)
Scrapy自动绕过https ssl。 就像@Aminah Nuraini所说的,仅在代理中间件中使用base64.encodestring而不是base64.b64encode
在middlewares.py中添加以下代码
import base64
class ProxyMiddleware(object):
# overwrite process request
def process_request(self, request, spider):
# Set the location of the proxy
request.meta['proxy'] = "<PROXY_SERVER>:<PROXY_PROT>"
# Use the following lines if your proxy requires authentication
proxy_user_pass = "<PROXY_USERNAME>:<PROXY_PASSWORD>"
# setup basic authentication for the proxy
encoded_user_pass = base64.b64encode(proxy_user_pass)
request.headers['Proxy-Authorization'] = 'Basic ' + encoded_user_pass
在settings.py中添加代理
DOWNLOADER_MIDDLEWARES = {
'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware': 110,
'<YOUR_PROJECT>.middlewares.ProxyMiddleware': 100,
}