我正在尝试使用BeautifulSoup4在Python 3.4中提取链接,并且没有标识元素标记,例如id,class等。但是,在每个链接之前,有一个静态的文本字符串,如下所示:
<h2>
"Precluding-Text:"
<a href="http://the-link-im-after.com">Varying Anchor Text</a>
</h2>
我的最终目标是获得以下输出:
http://the-link-im-after.com/
答案 0 :(得分:2)
您可以使用该静态文本找到链接:
soup.find(text="Precluding-Text:").find_next_sibling("a")["href"]
或者,您可能需要部分文本匹配:
soup.find(text=lambda text: text and "Precluding-Text:" in text).find_next_sibling("a")["href"]
答案 1 :(得分:0)
使用python generators的其他解决方案:
from bs4 import BeautifulSoup as soup
import re
html = """
<h2>
"Precluding-Text:"
<a href="http://the-link-im-after.com">Varying Anchor Text</a>
</h2>
"""
s = soup(html)
elements = s.find_all(text=re.compile('.*Precluding-Text:.*'))
if len(elements) == 0:
print("not found")
else:
for elem in elements:
gen = elem.next_siblings
a_tag = next(gen)
if a_tag.get('href') is not None:
print(a_tag.get('href'))