Question

我正在尝试从某些html元素的多个h标签中获得不同的标题。 h标签始终附有一些数字，例如h1，h14，h17。我知道我可以利用.select("h1,h11,h9")来获取它们，但是它们很多。如果它们是.select("[class^='heading']")，class="heading1"，class="heading2"之类的东西，我本可以使用class="heading3"处理它们。

如何使用选择器获取不同h标签的所有内容？

我的尝试

htmlelements="""
<h1>
    <a href="https://somesite.com/">SEC fight</a>
</h1>
<h11>
    <a href="https://somesite.com/">AFC fight</a>
</h11>
<h9>
    <a href="https://somesite.com/">UTY fight</a>
</h9>
"""

from bs4 import BeautifulSoup

page = BeautifulSoup(htmlelements, "lxml")
for item in page.select("h11"):
    print(item.text)

在这里.find_all(string=re.compile("h"))不能使用PS regex。

Answer 1

一种方法是仅对所有可能的.find_all()标签使用h：

htmlelements="""
<h1>
    <a href="https://somesite.com/">SEC fight</a>
</h1>
<h11>
    <a href="https://somesite.com/">AFC fight</a>
</h11>
<h9>
    <a href="https://somesite.com/">UTY fight</a>
</h9>
"""

from bs4 import BeautifulSoup

page = BeautifulSoup(htmlelements, "lxml")

for item in page.find_all(f"h{h}" for h in range(1, 20)):
    print(item.get_text(strip=True))

这将显示：

SEC fight
AFC fight
UTY fight

使用选择器

1 个答案: