Question

我第一次使用sracpy库抓取使用硒的网站。使用请求库我没有收到任何错误，但是使用下面的代码片段块给出的错误，使用了带有beautifulsoup的urllib3，目的是获取原始数据，而不是包含前200个字符的HTML脚本。要理解我的观点，请参考为您粘贴的代码。谢谢。

我尝试用Python抓取请求库以从目标网站提取数据。它工作正常，但是接下来我打算使用urllib3和beautifulsoup进行模拟作业，以提取原始数据而不是前200个字符的HTML脚本。我希望这是有道理的，如果没有，请问我。期待。

import requests
import urllib3
from bs4 import BeautifulSoup

# Extracting web data using requests urllib3 & BeautifulSoap

print "Retrieved the following data (Raw Form) using 'urllib3' lib \n"
http = urllib3.PoolManager()
r = http.request('GET', 'https://authoraditiagarwal.com')
soup = BeautifulSoup(r.data, 'lxml')
print soup.title
print soup.title.text

错误：

File "C:\Python27\lib\site-packages\urllib3\util\retry.py", line 399, in increment
    raise MaxRetryError(_pool, url, error or ResponseError(cause))
MaxRetryError: HTTPSConnectionPool(host='authoraditiagarwal.com', port=443): 
Max retries exceeded with url: / (Caused by SSLError(SSLError("bad handshake: 
Error([('SSL routines', 'tls_process_server_certificate', 'certificate verify failed')],)",),))

如何修复与Poolmanager https连接相关的python scrapy错误以从网页获取原始数据

0 个答案: