请参阅此HTML代码:
<html>
<body>
<p class="fixedfonts">
<a href="A.pdf">LINK1</a>
</p>
<h2>Results</h2>
<p class="fixedfonts">
<a href="B.pdf">LINK2</a>
</p>
<p class="fixedfonts">
<a href="C.pdf">LINK3</a>
</p>
</body>
</html>
它包含3个链接。但是,我只需要检索标题Results
我在BeautifulSoup中使用python:
from bs4 import BeautifulSoup, SoupStrainer
# at this point html contains the code as string
# parse the HTML file
soup = BeautifulSoup(html.replace('\n', ''), parse_only=SoupStrainer('a'))
# kill all script and style elements
for script in soup(["script", "style"]):
script.extract() # rip it out
links = list()
for link in soup:
if link.has_attr('href'):
links.append(link['href'].replace('%20', ' '))
print(links)
使用提供的代码,我获得了文档中的所有链接,但正如我所说,我只需要那些位于Results
标记/标题之后的链接。
指导表示赞赏
答案 0 :(得分:1)
您可以使用find_all_next()
method:
results = soup.find("h2", text="Results")
for link in results.find_all_next("a"):
print(link.get("href"))
演示:
>>> from bs4 import BeautifulSoup
>>>
>>> data = """
... <html>
... <body>
... <p class="fixedfonts">
... <a href="A.pdf">LINK1</a>
... </p>
...
... <h2>Results</h2>
...
... <p class="fixedfonts">
... <a href="B.pdf">LINK2</a>
... </p>
...
... <p class="fixedfonts">
... <a href="C.pdf">LINK3</a>
... </p>
... </body>
... </html>"""
>>>
>>> soup = BeautifulSoup(data, "html.parser")
>>> results = soup.find("h2", text="Results")
>>> for link in results.find_all_next("a"):
... print(link.get("href"))
...
B.pdf
C.pdf
答案 1 :(得分:0)
将html数据拆分为两部分,在“结果”之前和之后,然后使用后者处理它:
data = html.split("Results")
need = data[1]
所以请执行:
from bs4 import BeautifulSoup, SoupStrainer
data = html.split("Results")
need = data[1]
soup = BeautifulSoup(need.replace('\n', ''), parse_only=SoupStrainer('a'))
答案 2 :(得分:-1)
经过测试,似乎有效。
.gwt-Flextable