我有一些HTLML,我试图解析几乎没有类标识符的格式,所以我很少有BeautifulSoup来锁定。看起来有点像这样:
<h3>I am an important section of the list</h3>
<ul>
<li><a href="commonStuff/newThing1">Important text in here</a></li>
<li><a href="commonStuff/newThing2">Differentmportant text in here</a></li>
...
</ul>
<h3>I am another section of the list but I am not important</h3>
<ul>
<li><a href="I look like I could be important">Cool looking info in here></li>
<li><a href="I look like I could be important">Cool looking info in here></li>
</ul>
我只关心我关心的a
标签之间的h3
元素。当然,我目前的做法是:
sections = part.select('h3')
for section in sections:
if "I am an important section of the list" in section:
问题是我后来不知道该怎么做,因为那时我正在寻找之后的标题标签。我所看到的唯一方法是使用某种获取子功能。所以我这样做:
for body in section.next_siblings:
这件事有两件坏事
for links in body.find_all("a"):
,因为兄弟姐妹与我之前解析的原始html汤不一样如果它直接位于我关注的<a>
标记下,您如何建议转到href链接以及<h3>
标记内的文字?
这里的麻烦似乎是我希望直接在<h3>
标记之后的内容。如果我能以某种方式通过这些标签之间的内容分割文档,那将是很好的。
答案 0 :(得分:2)
next_siblings
没有复数,找到第一个下一个兄弟:
res = []
sections = part.find_all('h3',
string=lambda s:'I am an important section of the list' in s)
for section in sections:
for item in section.next_sibling.next_sibling.find_all('a'):
res.append(item.get('href'))
print(res)
>>>['commonStuff/newThing1', 'commonStuff/newThing2']
关于next_sibling
的解释:
如果您的html源代码在<h3>
之后不包含换行符,则您只需要一个next_sibling
。 BeautifulSoup将其解释为NavigableString
。
在第一个例子中,我们得到了换行符:
html = """
<h3>I am an important section of the list</h3>
<ul>
<li><a href="commonStuff/newThing1">Important text in here</a></li>
<li><a href="commonStuff/newThing2">Differentmportant text in here</a></li>
</ul>
"""
soup = soup(html, 'html.parser')
sections = soup.find_all('h3')
for section in sections:
print('next sibling : ', section.next_sibling)
print(type(section.next_sibling))
结果:
next sibling :
<class 'bs4.element.NavigableString'>
在这个,<h3>
之后没有换行符,我们直接获得了我们正在搜索的标签:
html = """
<h3>I am an important section of the list</h3><ul>
<li><a href="commonStuff/newThing1">Important text in here</a></li>
<li><a href="commonStuff/newThing2">Differentmportant text in here</a></li>
</ul>
"""
soup = soup(html, 'html.parser')
sections = soup.find_all('h3')
for section in sections:
print('next sibling : ', section.next_sibling)
print(type(section.next_sibling))
结果:
next sibling : <ul>
<li><a href="commonStuff/newThing1">Important text in here</a></li>
<li><a href="commonStuff/newThing2">Differentmportant text in here</a></li>
</ul>
<class 'bs4.element.Tag'>