我试图抓住的HTML:
<div id="unitType">
<h2>BB100 <br>v1.4.3</h2>
</div>
我有以下h2
标记的内容:
initialPage = beautifulSoup(urllib.urlopen(url).read(), 'html.parser')
deviceInfo = initialPage.find('div', {'id': 'unitType'}).h2.contents
print('Device Info: ', deviceInfo)
for i in deviceInfo:
print i
哪个输出:
('Device Info: ', [u'BB100 ', <br>v1.4.3</br>])
BB100
<br>v1.4.3</br>
如何使用BeautifulSoup而不是正则表达式删除<h2>
,</h2>
,<br>
和</br>
html标记?我已经尝试i.decompose()
和i.strip()
,但都没有效果。它会抛出'NoneType' object is not callable
。
答案 0 :(得分:2)
只需使用find和提取 br 标记:
In [15]: from bs4 import BeautifulSoup
...:
...: h = """<div id='unitType'><h2>BB10<br>v1.4.3</h2></d
...: iv>"""
...:
...: soup = BeautifulSoup(h, "html.parser")
...:
...: h2 = soup.find(id="unitType").h2
...: h2.find("br").extract()
...: print(h2)
...:
<h2>BB10</h2>
或者使用replace-with
仅使用文字替换标记In [16]: from bs4 import BeautifulSoup
...:
...: h = """<div id='unitType'><h2<br>v1.4.3 BB10</h2></d
...: iv>"""
...:
...: soup = BeautifulSoup(h, "html.parser")
...:
...: h2 = soup.find(id="unitType").h2
...:
...: br = h2.find("br")
...: br.replace_with(br.text)
...: print(h2)
...:
<h2>v1.4.3 BB10</h2>
删除 h2 并保留文字:
In [37]: h = """<div id='unitType'><h2><br>v1.4.3</h2></d
...:
...: iv>"""
...:
...: soup = BeautifulSoup(h, "html.parser")
...:
...: unit = soup.find(id="unitType")
...:
...: h2 = unit.find("h2")
...: h2.replace_with(h2.text)
...: print(unit)
...:
<div id="unitType">v1.4.3 BB10</div>
如果你只想要"v1.4.3"
和"BB10"
,有很多方法可以嘿嘿:
In [60]: h = """<div id="unitType">
...: <h2>BB100 <br>v1.4.3</h2>
...: </div>"""
...:
...: soup = BeautifulSoup(h, "html.parser")
...:
...: h2 = soup.find(id="unitType").h2
# just find all strings
...: a,b = h2.find_all(text=True)
...: print(a, b)
# get the br
...: br = h2.find("br")
# get br text and just the h2 text ignoring any text from children
...: a, b = h2.find(text=True, recursive=False), br.text
...: print(a, b)
...:
BB100 v1.4.3
BB100 v1.4.3
为什么你最终得到文字插件
答案 1 :(得分:0)
您可以检查该元素是否为带有<br>
的{{1}}标记,然后只需更改列表即可获得内容。
if i.name == 'br'
如果您需要多次迭代,请修改列表。
for i in deviceInfo:
if i.name == 'br':
i = i.contents