我只希望跨越范围之外的文本,而不是跨越范围内的任何内容。我目前的代码给了我所有这些:
birthday = bsObj.find( "div", {"class":"age"} )
# <div class="age"><span class="category">Age:</span> 23 (10/21/1992)</div>
birthday.get_text()
birthplace = bsObj.find( "div", {"class":"hometown"} )
# <div class="hometown"><span class="category">Birthplace:</span> Barranquilla, Colombia</div>
birthplace.get_text()
结果:
"Age: 24 (04/21/1991)","Birthplace: Barranquilla, Colombia"
期望的结果:
"24 (04/21/1991)","Barranquilla, Colombia"
答案 0 :(得分:3)
在get_text()
之前清除范围from bs4 import BeautifulSoup
html_doc ='<html><body><div class="age"><span class="category">Age:</span> 23 (10/21/1992)</div><div class="hometown"><span class="category">Birthplace:</span> Barranquilla, Colombia</div></body></html>'
bsObj = BeautifulSoup(html_doc, 'html.parser')
# <div class="age"><span class="category">Age:</span> 23 (10/21/1992)</div>
birthday = bsObj.find( "div", {"class":"age"} )
birthday.span.clear()
print(birthday.get_text()) # 23 (10/21/1992)
# <div class="hometown"><span class="category">Birthplace:</span> Barranquilla, Colombia</div>
birthplace = bsObj.find( "div", {"class":"hometown"} )
birthplace.span.clear()
print(birthplace.get_text()) # Barranquilla, Colombia
答案 1 :(得分:1)
span
clear()
strip()
from bs4 import BeautifulSoup
soup = BeautifulSoup('<div class="age"><span class="category">Age:</span> 23 (10/21/1992)</div>', 'html.parser')
soup.span.clear()
print(soup.get_text().strip())
输出:
23 (10/21/1992)