网页抓取动态HTML页面结构

时间:2020-04-17 03:05:22

标签: html python-3.x web-scraping beautifulsoup

我正在一个大型的Web抓取项目中,每个网页的HTML结构彼此不同。我想从网页上抓取产品说明,并且正在使用BeautifulSoup软件包。

例如,我要抓取的产品描述存储在HTML结构中:

<div class="product-description">
  <p> "Title" </p>
  <p> "Some content" </p>
  <p> "Product description" </p>
</div>


<div class="product-description">
  <p> "Title" </p>
  <p> "Product description" </p>
</div>

<div class="product-description">
  <p> "Title" </p>
  <p> "Some content" </p>
  <p> "Some content" </p>
  <p> "Product description" </p>
</div>


<div class="product-description">
  <p> "Title" </p>
  <p> "Some-content" </p>
  <p> "Some-content" </p>
  <p> "Some-content" </p>
  <p> "Product description" </p>
</div>

我编写了一个for循环,该循环从div类“ product-description”中获取数据,具体取决于页面结构。我的示例代码片段:

requests = (grequests.get(url) for url in urls)
responses = grequests.imap(requests, grequests.Pool(1000))

for response in responses:

        html_soup = BeautifulSoup(response.text, 'html.parser')

        if html_soup.find('div',class_='product_description').next_element.next_sibling.next_sibling.next_sibling.next_sibling:
                product_description = html_soup.find('div',class_='product_description').next_element.next_sibling.next_sibling.next_sibling.next_sibling.text

        elif html_soup.find('div', class_='product-description').next_element.next_sibling.next_sibling.next_sibling:
                product_description = html_soup.find(
                  'div', class_='product_description').next_element.next_sibling.next_sibling.next_sibling.text

        elif html_soup.find('div', class_='product-description').next_element.next_sibling.next_sibling:
                product_description = html_soup.find(
                  'div', class_='product_description').next_element.next_sibling.next_sibling.text

        else:
                product_description = html_soup.find(
                  'div', class_='product_description').next_element.next_sibling.text

我希望if条件可以检查当前HTML级别中是否有同级项,如果不可以检查后续条件。但是,经过3000次迭代,我得到一个Attribute error的说法,Nonetype object has no attribute next_sibling。屏幕截图如下:

Attribute error

我知道必须有其他更简单的方法来处理这种动态页面结构。任何帮助将非常感激。预先感谢!

1 个答案:

答案 0 :(得分:1)

尝试一下:

for i in soup.find_all('div',class_="product-description"):
    try:
        print(i.find_all('p')[-1].text)
    except:
        pass

这里的汤是:

<div class="product-description">
  <p> "Title" </p>
  <p> "Some content" </p>
  <p> "Product description" </p>
</div>


<div class="product-description">
  <p> "Title" </p>
  <p> "Product description" </p>
</div>

<div class="product-description">
  <p> "Title" </p>
  <p> "Some content" </p>
  <p> "Some content" </p>
  <p> "Product description" </p>
</div>


<div class="product-description">
  <p> "Title" </p>
  <p> "Some-content" </p>
  <p> "Some-content" </p>
  <p> "Some-content" </p>
  <p> "Product description" </p>
</div>