我正在开展数据抓取项目,我的代码使用 Scrapy (版本 1.0.4 )和 Selenium (版本 2.47.1 )。
from scrapy import Spider
from scrapy.selector import Selector
from scrapy.http import Request
from scrapy.spiders import CrawlSpider
from selenium import webdriver
class TradesySpider(CrawlSpider):
name = 'tradesy'
start_urls = ['My Start url',]
def __init__(self):
self.driver = webdriver.Firefox()
def parse(self, response):
self.driver.get(response.url)
while True:
tradesy_urls = Selector(response).xpath('//div[@id="right-panel"]"]')
data_urls = tradesy_urls.xpath('div[@class="item streamline"]/a/@href').extract()
for link in data_urls:
url = 'My base url'+link
yield Request(url=url,callback=self.parse_data)
time.sleep(10)
try:
data_path = self.driver.find_element_by_xpath('//*[@id="page-next"]')
except:
break
data_path.click()
time.sleep(10)
def parse_data(self,response):
'Scrapy Operations...'
当我执行我的代码时,我得到了一些网址的预期输出,但对于其他网址,我收到了以下错误。
2016-01-19 15:45:17 [scrapy] DEBUG: Retrying <GET MY_URL> (failed 1 times): [<twisted.python.failure.Failure OpenSSL.SSL.Error: [('SSL routines', 'SSL3_READ_BYTES', 'ssl handshake failure')]>]
请为此查询提供解决方案。
答案 0 :(得分:10)
根据此reported issue,您可以创建自己的ContextFactory
来处理SSL。
<强> context.py:强>
from OpenSSL import SSL
from scrapy.core.downloader.contextfactory import ScrapyClientContextFactory
class CustomContextFactory(ScrapyClientContextFactory):
"""
Custom context factory that allows SSL negotiation.
"""
def __init__(self):
# Use SSLv23_METHOD so we can use protocol negotiation
self.method = SSL.SSLv23_METHOD
<强> settings.py 强>
DOWNLOADER_CLIENTCONTEXTFACTORY = 'yourproject.context.CustomContextFactory'
答案 1 :(得分:0)
eLRuLL答案的变体,不需要额外的文件。它“装饰”了ScrapyClientContextFactory类的init方法。
from OpenSSL import SSL
from scrapy.core.downloader.contextfactory import ScrapyClientContextFactory
init = ScrapyClientContextFactory.__init__
def init2(self, *args, **kwargs):
init(self, *args, **kwargs)
self.method = SSL.SSLv23_METHOD
ScrapyClientContextFactory.__init__ = init2
答案 2 :(得分:0)
使用Scrapy 1.5.0时遇到了此错误:
Error downloading: https://my.website.com>: [<twisted.python.failure.Failure OpenSSL.SSL.Error: [('SSL routines', 'tls12_check_peer_sigalg', 'wrong curve')]>]
最终起作用的是更新了我的Twisted版本(从17.9.0-> 19.10.0)。我还将Scrapy更新为2.4.0,以及其他一些版本: