在某个标题后从HTML获取链接

时间:2016-05-25 12:46:06

标签: python html beautifulsoup

请参阅此HTML代码:

<html>
    <body>
        <p class="fixedfonts">
            <a href="A.pdf">LINK1</a>
        </p>

        <h2>Results</h2>

        <p class="fixedfonts">
            <a href="B.pdf">LINK2</a>
        </p>

        <p class="fixedfonts">
            <a href="C.pdf">LINK3</a>
        </p>
    </body>
</html>

它包含3个链接。但是,我只需要检索标题Results

之后的链接

我在BeautifulSoup中使用python:

from bs4 import BeautifulSoup, SoupStrainer

# at this point html contains the code as string

# parse the HTML file
soup = BeautifulSoup(html.replace('\n', ''), parse_only=SoupStrainer('a'))

# kill all script and style elements
for script in soup(["script", "style"]):
    script.extract()  # rip it out

links = list()
for link in soup:
    if link.has_attr('href'):
        links.append(link['href'].replace('%20', ' '))

print(links)

使用提供的代码,我获得了文档中的所有链接,但正如我所说,我只需要那些位于Results标记/标题之后的链接。

指导表示赞赏

3 个答案:

答案 0 :(得分:1)

您可以使用find_all_next() method

解决这个问题
results = soup.find("h2", text="Results")
for link in results.find_all_next("a"):
    print(link.get("href"))

演示:

>>> from bs4 import BeautifulSoup
>>> 
>>> data = """
... <html>
...     <body>
...         <p class="fixedfonts">
...             <a href="A.pdf">LINK1</a>
...         </p>
... 
...         <h2>Results</h2>
... 
...         <p class="fixedfonts">
...             <a href="B.pdf">LINK2</a>
...         </p>
... 
...         <p class="fixedfonts">
...             <a href="C.pdf">LINK3</a>
...         </p>
...     </body>
... </html>"""
>>> 
>>> soup = BeautifulSoup(data, "html.parser")
>>> results = soup.find("h2", text="Results")
>>> for link in results.find_all_next("a"):
...     print(link.get("href"))
... 
B.pdf
C.pdf

答案 1 :(得分:0)

将html数据拆分为两部分,在“结果”之前和之后,然后使用后者处理它:

data = html.split("Results")
need = data[1]

所以请执行:

from bs4 import BeautifulSoup, SoupStrainer
data = html.split("Results")
need = data[1]
soup = BeautifulSoup(need.replace('\n', ''), parse_only=SoupStrainer('a'))

答案 2 :(得分:-1)

经过测试,似乎有效。

.gwt-Flextable