我正在一个大型的Web抓取项目中,每个网页的HTML结构彼此不同。我想从网页上抓取产品说明,并且正在使用BeautifulSoup软件包。
例如,我要抓取的产品描述存储在HTML结构中:
<div class="product-description">
<p> "Title" </p>
<p> "Some content" </p>
<p> "Product description" </p>
</div>
<div class="product-description">
<p> "Title" </p>
<p> "Product description" </p>
</div>
<div class="product-description">
<p> "Title" </p>
<p> "Some content" </p>
<p> "Some content" </p>
<p> "Product description" </p>
</div>
<div class="product-description">
<p> "Title" </p>
<p> "Some-content" </p>
<p> "Some-content" </p>
<p> "Some-content" </p>
<p> "Product description" </p>
</div>
我编写了一个for循环,该循环从div类“ product-description”中获取数据,具体取决于页面结构。我的示例代码片段:
requests = (grequests.get(url) for url in urls)
responses = grequests.imap(requests, grequests.Pool(1000))
for response in responses:
html_soup = BeautifulSoup(response.text, 'html.parser')
if html_soup.find('div',class_='product_description').next_element.next_sibling.next_sibling.next_sibling.next_sibling:
product_description = html_soup.find('div',class_='product_description').next_element.next_sibling.next_sibling.next_sibling.next_sibling.text
elif html_soup.find('div', class_='product-description').next_element.next_sibling.next_sibling.next_sibling:
product_description = html_soup.find(
'div', class_='product_description').next_element.next_sibling.next_sibling.next_sibling.text
elif html_soup.find('div', class_='product-description').next_element.next_sibling.next_sibling:
product_description = html_soup.find(
'div', class_='product_description').next_element.next_sibling.next_sibling.text
else:
product_description = html_soup.find(
'div', class_='product_description').next_element.next_sibling.text
我希望if条件可以检查当前HTML级别中是否有同级项,如果不可以检查后续条件。但是,经过3000次迭代,我得到一个Attribute error
的说法,Nonetype object has no attribute next_sibling
。屏幕截图如下:
我知道必须有其他更简单的方法来处理这种动态页面结构。任何帮助将非常感激。预先感谢!
答案 0 :(得分:1)
尝试一下:
for i in soup.find_all('div',class_="product-description"):
try:
print(i.find_all('p')[-1].text)
except:
pass
这里的汤是:
<div class="product-description">
<p> "Title" </p>
<p> "Some content" </p>
<p> "Product description" </p>
</div>
<div class="product-description">
<p> "Title" </p>
<p> "Product description" </p>
</div>
<div class="product-description">
<p> "Title" </p>
<p> "Some content" </p>
<p> "Some content" </p>
<p> "Product description" </p>
</div>
<div class="product-description">
<p> "Title" </p>
<p> "Some-content" </p>
<p> "Some-content" </p>
<p> "Some-content" </p>
<p> "Product description" </p>
</div>