我正在尝试从黄页获取数据,但是我只需要编号的水管工。但是我无法在 h2 class ='n'中获得文本编号。我可以获得 class =“ business-name” 文本,但我只需要编号的水管工即可,不带广告。我怎么了非常感谢。
此html:
<div class="info">
<h2 class="n">1. <a class="business-name" href="/austin-tx/mip/johnny-rooter-11404675?lid=171372530" rel="" data-impressed="1"><span>Johnny Rooter</span></a></h2>
</div>
这是我的python代码:
import requests
from bs4 import BeautifulSoup as bs
url = "https://www.yellowpages.com/austin-tx/plumbers"
req = requests.get(url)
data = req.content
soup = bs(data, "lxml")
links = soup.findAll("div", {"class": "info"})
for link in links:
for content in link.contents:
try:
print(content.find("h2", {"class": "n"}).text)
except:
pass
答案 0 :(得分:0)
您需要一个不同的类选择器来限制该部分
import requests
from bs4 import BeautifulSoup as bs
url = "https://www.yellowpages.com/austin-tx/plumbers"
req = requests.get(url)
data = req.content
soup = bs(data, "lxml")
links = [item.text.replace('\xa0','') for item in soup.select('.organic h2')]
print(links)
.organic
是复合类的单个类选择器,用于限制所有编号的管道工的父元素。观察广告后突出显示的开始方式:
输出: