Question

我有一些HTLML，我试图解析几乎没有类标识符的格式，所以我很少有BeautifulSoup来锁定。看起来有点像这样：

<h3>I am an important section of the list</h3>
<ul>
    <li><a href="commonStuff/newThing1">Important text in here</a></li>
    <li><a href="commonStuff/newThing2">Differentmportant text in here</a></li>
    ...
</ul>
<h3>I am another section of the list but I am not important</h3>
<ul>
    <li><a href="I look like I could be important">Cool looking info in here></li>
    <li><a href="I look like I could be important">Cool looking info in here></li>  
</ul>

我只关心我关心的a标签之间的h3元素。当然，我目前的做法是：

sections = part.select('h3')
        for section in sections:
            if "I am an important section of the list" in section:

问题是我后来不知道该怎么做，因为那时我正在寻找之后的标题标签。我所看到的唯一方法是使用某种获取子功能。所以我这样做：

for body in section.next_siblings:

这件事有两件坏事

之后应该只有一个兄弟姐妹。我真的不知道在什么情况下会有多个

我不能for links in body.find_all("a"):，因为兄弟姐妹与我之前解析的原始html汤不一样

如果它直接位于我关注的<a>标记下，您如何建议转到href链接以及<h3>标记内的文字？

这里的麻烦似乎是我希望直接在<h3>标记之后的内容。如果我能以某种方式通过这些标签之间的内容分割文档，那将是很好的。

Answer 1

next_siblings没有复数，找到第一个下一个兄弟：

res = []
sections = part.find_all('h3', 
                         string=lambda s:'I am an important section of the list' in s)
for section in sections:
    for item in section.next_sibling.next_sibling.find_all('a'):
        res.append(item.get('href'))

print(res)

>>>['commonStuff/newThing1', 'commonStuff/newThing2']

关于next_sibling的解释：

如果您的html源代码在<h3>之后不包含换行符，则您只需要一个next_sibling。 BeautifulSoup将其解释为NavigableString。

在第一个例子中，我们得到了换行符：

html = """
<h3>I am an important section of the list</h3>
<ul>
    <li><a href="commonStuff/newThing1">Important text in here</a></li>
    <li><a href="commonStuff/newThing2">Differentmportant text in here</a></li>
</ul>
 """
soup = soup(html, 'html.parser')

sections = soup.find_all('h3')
for section in sections:
    print('next sibling : ', section.next_sibling)
    print(type(section.next_sibling))

结果：

next sibling :  

<class 'bs4.element.NavigableString'>

在这个，<h3>之后没有换行符，我们直接获得了我们正在搜索的标签：

html = """
<h3>I am an important section of the list</h3><ul>
    <li><a href="commonStuff/newThing1">Important text in here</a></li>
    <li><a href="commonStuff/newThing2">Differentmportant text in here</a></li>
</ul>
 """
soup = soup(html, 'html.parser')

sections = soup.find_all('h3')
for section in sections:
    print('next sibling : ', section.next_sibling)
    print(type(section.next_sibling))

结果：

next sibling :  <ul>
<li><a href="commonStuff/newThing1">Important text in here</a></li>
<li><a href="commonStuff/newThing2">Differentmportant text in here</a></li>
</ul>
<class 'bs4.element.Tag'>

如果你是美丽汤中的兄弟姐妹，就找不到所有的东西

1 个答案: