Question

请参阅此HTML代码：

<html>
    <body>
        <p class="fixedfonts">
            <a href="A.pdf">LINK1</a>
        </p>

        <h2>Results</h2>

        <p class="fixedfonts">
            <a href="B.pdf">LINK2</a>
        </p>

        <p class="fixedfonts">
            <a href="C.pdf">LINK3</a>
        </p>
    </body>
</html>

它包含3个链接。但是，我只需要检索标题Results

之后的链接

我在BeautifulSoup中使用python：

from bs4 import BeautifulSoup, SoupStrainer

# at this point html contains the code as string

# parse the HTML file
soup = BeautifulSoup(html.replace('\n', ''), parse_only=SoupStrainer('a'))

# kill all script and style elements
for script in soup(["script", "style"]):
    script.extract()  # rip it out

links = list()
for link in soup:
    if link.has_attr('href'):
        links.append(link['href'].replace('%20', ' '))

print(links)

使用提供的代码，我获得了文档中的所有链接，但正如我所说，我只需要那些位于Results标记/标题之后的链接。

指导表示赞赏

Answer 1

您可以使用find_all_next() method：

解决这个问题

results = soup.find("h2", text="Results")
for link in results.find_all_next("a"):
    print(link.get("href"))

演示：

>>> from bs4 import BeautifulSoup
>>> 
>>> data = """
... <html>
...     <body>
...         <p class="fixedfonts">
...             <a href="A.pdf">LINK1</a>
...         </p>
... 
...         <h2>Results</h2>
... 
...         <p class="fixedfonts">
...             <a href="B.pdf">LINK2</a>
...         </p>
... 
...         <p class="fixedfonts">
...             <a href="C.pdf">LINK3</a>
...         </p>
...     </body>
... </html>"""
>>> 
>>> soup = BeautifulSoup(data, "html.parser")
>>> results = soup.find("h2", text="Results")
>>> for link in results.find_all_next("a"):
...     print(link.get("href"))
... 
B.pdf
C.pdf

Answer 2

将html数据拆分为两部分，在“结果”之前和之后，然后使用后者处理它：

data = html.split("Results")
need = data[1]

所以请执行：

from bs4 import BeautifulSoup, SoupStrainer
data = html.split("Results")
need = data[1]
soup = BeautifulSoup(need.replace('\n', ''), parse_only=SoupStrainer('a'))

Answer 3

经过测试，似乎有效。

.gwt-Flextable

在某个标题后从HTML获取链接

3 个答案: