我已经解析了html页面:使用beautifulsoup
badges = soup.body.find('div', attrs={'class': 'col-md-11'})
在此之后,我的badges
对象看起来像这样:
<div class="col-md-11">
<h4>
<span class="fas fa-user-circle padding-right-sm text-green"></span><span class="label label-success">Avocat definitiv</span>
<font style="font-weight:bold;">NEDELCU Paul-Iulian</font>, Baroul Dolj
<span style="color:green;font-weight:bold;"> [activ]</span>
</h4>
<p>
<span class="fas fa-map-marker text-red padding-right-sm"></span>Sediu principal în Baroul Dolj, adresă: mun.Craiova, str.Mihail kogălniceanu, nr.16, jud.Dolj, tel.
</p>
<p>
<span class="padding-right-md text-primary"><span class="fal fa-phone text-primary padding-right-sm"></span></span>
<span class="text-nowrap"><span class="fal fa-envelope text-info padding-right-sm"></span>paul_iulyan@yahoo.com</span>
</p>
</div>
现在,我要提取 NEDELCU Paul-Iulian , Baroul Dolj , [激活] ,塞迪乌校长înBaroul Dolj < / strong>,地址:M.Craiova,Mihailkogălniceanu,nr.16,jud.Dolj,tel。和 paul_iulyan@yahoo.com
我尝试使用badges.span.span
,但这不起作用。
答案 0 :(得分:2)
使用soup.find
演示:
from bs4 import BeautifulSoup
s = """<div class="col-md-11">
<h4>
<span class="fas fa-user-circle padding-right-sm text-green"></span><span class="label label-success">Avocat definitiv</span>
<font style="font-weight:bold;">NEDELCU Paul-Iulian</font>, Baroul Dolj
<span style="color:green;font-weight:bold;"> [activ]</span>
</h4>
<p>
<span class="fas fa-map-marker text-red padding-right-sm"></span>Sediu principal în Baroul Dolj, adresă: mun.Craiova, str.Mihail kogălniceanu, nr.16, jud.Dolj, tel.
</p>
<p>
<span class="padding-right-md text-primary"><span class="fal fa-phone text-primary padding-right-sm"></span></span>
<span class="text-nowrap"><span class="fal fa-envelope text-info padding-right-sm"></span>paul_iulyan@yahoo.com</span>
</p>
</div>"""
soup = BeautifulSoup(s, "html.parser")
val = soup.find("font", {"style":"font-weight:bold;"})
print( "{} {}".format(val.text, val.next_sibling ).strip() )
print( soup.find("span", {"style":"color:green;font-weight:bold;"}).text.strip() )
print( soup.find("span", class_="fas fa-map-marker text-red padding-right-sm").next_sibling.strip() )
print( soup.find("span", class_="text-nowrap").text.strip() )
输出:
NEDELCU Paul-Iulian , Baroul Dolj
[activ]
Sediu principal în Baroul Dolj, adresă: mun.Craiova, str.Mihail kogălniceanu, nr.16, jud.Dolj, tel.
paul_iulyan@yahoo.com
答案 1 :(得分:1)
使用单个soup.select
方法的优化解决方案:
for el in badges.select('h4 font, h4 span:nth-of-type(3), p:nth-of-type(1), p:nth-of-type(2) > span.text-nowrap'):
if el.name == 'font':
result.extend([el.text.strip(), el.nextSibling.strip()])
else:
result.append(el.text.strip())
print(result)
输出(格式化):
['NEDELCU Paul-Iulian',
', Baroul Dolj',
'[activ]',
'Sediu principal în Baroul Dolj, adresă: mun.Craiova, str.Mihail kogălniceanu, nr.16, jud.Dolj, tel.',
'paul_iulyan@yahoo.com']