我正在尝试按照教程here来从Remax.com抓取数据。目前,我只是对获得特定房屋的平方英尺感兴趣。虽然出现此错误:
Error during requests to https://www.remax.com/realestatehomesforsale/25-montage-way-laguna-beach-ca-92651-gid100012499996.html : HTTPSConnectionPool(host='www.remax.com', port=443): Max retries exceeded with url: /realestatehomesforsale/25-montage-way-laguna-beach-ca-92651-gid100012499996.html (Caused by SSLError(SSLError("bad handshake: Error([('SSL routines', 'ssl3_get_server_certificate', 'certificate verify failed')],)",),))
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-7-28b8e2248942> in <module>()
1 raw_html = simple_get('https://www.remax.com/realestatehomesforsale/25-montage-way-laguna-beach-ca-92651-gid100012499996.html')
----> 2 html = BeautifulSoup(raw_html, 'html.parser')
3 for i, li in enumerate(html.select('li')):
4 print(i, li.text)
~\AppData\Local\Continuum\anaconda3\lib\site-packages\bs4\__init__.py in __init__(self, markup, features, builder, parse_only, from_encoding, exclude_encodings, **kwargs)
190 if hasattr(markup, 'read'): # It's a file-type object.
191 markup = markup.read()
--> 192 elif len(markup) <= 256 and (
193 (isinstance(markup, bytes) and not b'<' in markup)
194 or (isinstance(markup, str) and not '<' in markup)
TypeError: object of type 'NoneType' has no len()
这是我到目前为止的全部代码:
from requests import get
from requests.exceptions import RequestException
from contextlib import closing
from bs4 import BeautifulSoup
def simple_get(url):
"""
Attempts to get the content at `url` by making an HTTP GET request.
If the content-type of response is some kind of HTML/XML, return the
text content, otherwise return None.
"""
try:
with closing(get(url, stream=True)) as resp:
if is_good_response(resp):
return resp.content
else:
return None
except RequestException as e:
log_error('Error during requests to {0} : {1}'.format(url, str(e)))
return None
def is_good_response(resp):
"""
Returns True if the response seems to be HTML, False otherwise.
"""
content_type = resp.headers['Content-Type'].lower()
return (resp.status_code == 200
and content_type is not None
and content_type.find('html') > -1)
raw_html = simple_get('https://www.remax.com/realestatehomesforsale/25-montage-way-laguna-beach-ca-92651-gid100012499996.html')
html = BeautifulSoup(raw_html, 'html.parser')
for i, li in enumerate(html.select('li')):
print(i, li.text)
我对网页抓取还很陌生,所以不确定如何解决此问题。任何建议将不胜感激。
答案 0 :(得分:1)
不确定您的问题,但是如果您只想知道该页面上房屋的平方英尺,可以使用
import urllib
from bs4 import BeautifulSoup
url = 'https://www.remax.com/realestatehomesforsale/25-montage-way-laguna-beach-ca-92651-gid100012499996.html'
hdr = {'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.11 (KHTML, like Gecko) Chrome/23.0.1271.64 Safari/537.11',
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
'Accept-Charset': 'ISO-8859-1,utf-8;q=0.7,*;q=0.3',
'Accept-Encoding': 'none',
'Accept-Language': 'en-US,en;q=0.8',
'Connection': 'keep-alive'}
request = urllib.request.Request(url, headers=hdr)
html = urllib.request.urlopen(request).read()
soup = BeautifulSoup(html,'html.parser')
foot = soup.find('span', class_="listing-detail-sqft-val")
foot.text.strip()
输出:
'7,604'
答案 1 :(得分:1)
如果请求失败,您的simple_get()
函数将返回None
。因此,您应该在使用前对其进行测试。可以这样完成:
from requests import get
from requests.exceptions import RequestException
from contextlib import closing
from bs4 import BeautifulSoup
def simple_get(url):
"""
Attempts to get the content at `url` by making an HTTP GET request.
If the content-type of response is some kind of HTML/XML, return the
text content, otherwise return None.
"""
try:
with closing(get(url, stream=True)) as resp:
if is_good_response(resp):
return resp.content
else:
return None
except RequestException as e:
log_error('Error during requests to {0} : {1}'.format(url, str(e)))
return None
def is_good_response(resp):
"""
Returns True if the response seems to be HTML, False otherwise.
"""
content_type = resp.headers['Content-Type'].lower()
return (resp.status_code == 200
and content_type is not None
and content_type.find('html') > -1)
url = 'https://www.remax.com/realestatehomesforsale/25-montage-way-laguna-beach-ca-92651-gid100012499996.html'
raw_html = simple_get(url)
if raw_html:
html = BeautifulSoup(raw_html, 'html.parser')
for i, li in enumerate(html.select('li')):
print(i, li.text)
else:
print(f"get failed for '{url}'")
为简单起见,以下内容将为您提供相同的错误消息:
from bs4 import BeautifulSoup
html = BeautifulSoup(None, 'html.parser')