我在更大的文档中有以下HTML
<br />
Important Text 1
<br />
<br />
Not Important Text
<br />
Important Text 2
<br />
Important Text 3
<br />
<br />
Non Important Text
<br />
Important Text 4
<br />
我目前正在使用BeautifulSoup来获取HTML中的其他元素,但我无法找到在<br />
标记之间获取重要文本行的方法。我可以隔离并导航到每个<br />
元素,但无法找到介于两者之间的文本的方法。任何帮助将不胜感激。感谢。
答案 0 :(得分:22)
如果您只想要两个<br />
标签之间的任何文字,您可以执行以下操作:
from BeautifulSoup import BeautifulSoup, NavigableString, Tag
input = '''<br />
Important Text 1
<br />
<br />
Not Important Text
<br />
Important Text 2
<br />
Important Text 3
<br />
<br />
Non Important Text
<br />
Important Text 4
<br />'''
soup = BeautifulSoup(input)
for br in soup.findAll('br'):
next_s = br.nextSibling
if not (next_s and isinstance(next_s,NavigableString)):
continue
next2_s = next_s.nextSibling
if next2_s and isinstance(next2_s,Tag) and next2_s.name == 'br':
text = str(next_s).strip()
if text:
print "Found:", next_s
但也许我误解了你的问题?您对问题的描述似乎与示例数据中的“重要”/“非重要”不符,所以我已经删除了描述;)
答案 1 :(得分:6)
因此,出于测试目的,我们假设这个HTML块位于span
标记内:
x = """<span><br />
Important Text 1
<br />
<br />
Not Important Text
<br />
Important Text 2
<br />
Important Text 3
<br />
<br />
Non Important Text
<br />
Important Text 4
<br /></span>"""
现在我要解析它并找到我的span标记:
from BeautifulSoup import BeautifulSoup
y = soup.find('span')
如果你在y.childGenerator()
中迭代生成器,你将获得br和文本:
In [4]: for a in y.childGenerator(): print type(a), str(a)
....:
<type 'instance'> <br />
<class 'BeautifulSoup.NavigableString'>
Important Text 1
<type 'instance'> <br />
<class 'BeautifulSoup.NavigableString'>
<type 'instance'> <br />
<class 'BeautifulSoup.NavigableString'>
Not Important Text
<type 'instance'> <br />
<class 'BeautifulSoup.NavigableString'>
Important Text 2
<type 'instance'> <br />
<class 'BeautifulSoup.NavigableString'>
Important Text 3
<type 'instance'> <br />
<class 'BeautifulSoup.NavigableString'>
<type 'instance'> <br />
<class 'BeautifulSoup.NavigableString'>
Non Important Text
<type 'instance'> <br />
<class 'BeautifulSoup.NavigableString'>
Important Text 4
<type 'instance'> <br />
答案 2 :(得分:0)
以下对我有用:
for br in soup.findAll('br'):
if str(type(br.contents[0])) == '<class \'BeautifulSoup.NavigableString\'>':
print br.contents[0]
答案 3 :(得分:0)
对 Ken Kinder 的回答略有改进。您可以改为访问 BeautifulSoup 元素的 stripped_strings
属性。例如,假设您的特定 HTML 块位于 span
标记内:
x = """<span><br />
Important Text 1
<br />
<br />
Not Important Text
<br />
Important Text 2
<br />
Important Text 3
<br />
<br />
Non Important Text
<br />
Important Text 4
<br /></span>"""
首先我们用 BeautifulSoup 解析 x
。然后查找元素,在本例中为 span
,然后访问 stripped_strings
属性。像这样,
from bs4 import BeautifulSoup
soup = BeautifulSoup(x)
span = soup.find("span")
text = list(span.stripped_strings)
现在 print(text)
将给出以下输出:
['Important Text 1',
'Not Important Text',
'Important Text 2',
'Important Text 3',
'Non Important Text',
'Important Text 4']