在Python 3中玩网络抓取。我试图使用Python 3中的BeautifulSoup库对以下HTML页面进行网络抓取。
<div class="container">
<div item-id="1" class="container1 item details">
<div class="item-name">
<div class="item-business-name">
<h3>My Business #1</h3>
</div>
</div>
<div class="item-location">
<div class="item-address">
<p>My Address #1</p>
</div>
</div>
<div class="item-contact">
<div class="item-email">
<p>My Email #1</p>
</div>
</div>
</div>
<div item-id="2" class="container2 item details">
<div class="item-name">
<div class="item-business-name">
<h3>My Business #2</h3>
</div>
</div>
<div class="item-location">
<div class="item-address">
<p>My Address #2</p>
</div>
</div>
<div class="item-contact">
<div class="item-email">
<p>My Email #2</p>
</div>
</div>
</div>
</div>
随着所需容器的命名模式(例如div item-id =“ 1” class =“ container1项目详细信息”)随每个项目的变化而变化,我很乐意就如何刮取item-business-name提供一些建议,所有商品的商品地址,商品联系方式。页面上有100个项目;因此,最后一个容器是div item-id =“ 100” class =“ container100 item details”
在“容器1-100”级别获取列表中所有项目的最佳方法是什么,此后,我知道如何获取每个项目:) 我在想类似的东西:
n = 0
while n < 101
item = soup.find_all(class_=f"container{n} item details")
n = n + 1
print(item)
答案 0 :(得分:0)
您不必在id
或类名中使用任何数字。只需使用zip()
函数将联系人的所有信息绑定在一起
变量txt
包含您问题中的HTML字符串:
from bs4 import BeautifulSoup
soup = BeautifulSoup(txt, 'html.parser')
out = []
for name, address, email in zip(soup.select('.item-business-name'),
soup.select('.item-address'),
soup.select('.item-email')):
out.append([name.get_text(strip=True), address.get_text(strip=True), email.get_text(strip=True)])
# pretty print all rows:
for no, row in enumerate(out, 1):
print('{}.'.format(no), ('{:<20}'*3).format(*row))
打印:
1. My Business #1 My Address #1 My Email #1
2. My Business #2 My Address #2 My Email #2