when I am trying to scrape the data from the following website
I got this from bedbathbeyond website and if I use request and beautifulsoup, I can't get anything. Why is that?
code:
r = requests.get(url)
soup = BeautifulSoup(r.content,'lxml')
soup.find_all('span', class_ = 'BVRRReviewAbbreviatedText')
the return value is empty: []
答案 0 :(得分:0)
我使用js2py
,因为materials
对象包含多个键(BVRRRatingSummarySourceID
,BVRRSecondaryRatingSummarySourceID
和BVRRSourceID
)并且使用正则表达式从其值获取HTML如果你需要这一切,那就更难了。
from bs4 import BeautifulSoup
import js2py
import requests
r = requests.get('https://bedbathandbeyond.ugc.bazaarvoice.com/2009-en_us/1061083288/reviews.djs?format=embeddedhtml')
pattern = (r'var'
r'\s+'
r'materials'
r'\s*=\s*'
r'{"BVRRRatingSummarySourceID".*}')
js_materials = re.search(pattern, r.text).group()
obj = js2py.eval_js(js_materials).to_dict()
html = obj['BVRRSourceID']
soup = BeautifulSoup(html, 'lxml')
spans = soup.select('span.BVRRReviewAbbreviatedText')
>>> len(spans)
5
在下面的示例中,我只使用了BVRRSourceID
键下的HTML,但您可以通过将值加在一起来使用整个HTML:
html = ''.join(obj.values())
如果您想使用js2py
解析器,请不要忘记安装pip install js2py
:pip install lxml
和lxml
。
答案 1 :(得分:-1)
您可以使用selenium webdriver获取您感兴趣的html内容。例如,
from selenium import webdriver
def get_html(url):
driver = webdriver.Chrome()
driver.maximize_window()
driver.get(url)
time.sleep(5)
html_content = driver.page_source.strip()
return html_content