Question

when I am trying to scrape the data from the following website

url = https://bedbathandbeyond.ugc.bazaarvoice.com/2009-en_us/1061083288/reviews.djs?format=embeddedhtml&page=4&scrollToTop=true

I got this from bedbathbeyond website and if I use request and beautifulsoup, I can't get anything. Why is that?

code:

r = requests.get(url)
soup = BeautifulSoup(r.content,'lxml')
soup.find_all('span', class_ = 'BVRRReviewAbbreviatedText')

the return value is empty: []

Answer 1

我使用js2py，因为materials对象包含多个键（BVRRRatingSummarySourceID，BVRRSecondaryRatingSummarySourceID和BVRRSourceID）并且使用正则表达式从其值获取HTML如果你需要这一切，那就更难了。

from bs4 import BeautifulSoup
import js2py
import requests

r = requests.get('https://bedbathandbeyond.ugc.bazaarvoice.com/2009-en_us/1061083288/reviews.djs?format=embeddedhtml')

pattern = (r'var'
           r'\s+'
           r'materials'
           r'\s*=\s*'
           r'{"BVRRRatingSummarySourceID".*}')

js_materials = re.search(pattern, r.text).group()
obj = js2py.eval_js(js_materials).to_dict()
html = obj['BVRRSourceID']
soup = BeautifulSoup(html, 'lxml')
spans = soup.select('span.BVRRReviewAbbreviatedText')

>>> len(spans)
5

在下面的示例中，我只使用了BVRRSourceID键下的HTML，但您可以通过将值加在一起来使用整个HTML：

html = ''.join(obj.values())

如果您想使用js2py解析器，请不要忘记安装pip install js2py：pip install lxml和lxml。

Answer 2

您可以使用selenium webdriver获取您感兴趣的html内容。例如，

from selenium import webdriver


def get_html(url):
    driver = webdriver.Chrome()
    driver.maximize_window()
    driver.get(url)

    time.sleep(5)
    html_content = driver.page_source.strip()
    return html_content

如何解析<pre> tag using beautifulsoup?

2 个答案: