给出html页面源,例如:
<html>
<head></head>
<body>
<p><nobr><a href="...">Some link text</a></nobr><p>
</body>
</html>
并且没有明确知道哪些标签包裹着<a>
元素(可能是任何东西,而不仅仅是nobr)。我该如何创建一个循环,以不断解开给定<a>
标记的父项,直到其父项是一个段落?
类似的东西:
import urllib3
from bs4 import BeautifulSoup as bs
http = urllib3.PoolManager()
urllib3.disable_warnings(urllib3.exceptions.InsecureRequestWarning)
page = "https://www.snopes.com/fact-check/rebuilding-iraq/"
link="http://www.elca.org/ScriptLib/OS/Congregations/cdsDetail.asp?congrno=12962"
r = http.request('get', page)
body = r.data
soup = bs(body, 'lxml')
a = soup.find('a', href=link)
while True:
if a.parent.name == "p":
break
else:
a.parent.name.unwrap() #doesnt work as name is string
print(soup)
答案 0 :(得分:2)
对给定的子标记使用find_parents
。
import requests
from bs4 import BeautifulSoup
page = "https://www.snopes.com/fact-check/rebuilding-iraq/"
link="http://www.elca.org/ScriptLib/OS/Congregations/cdsDetail.asp?congrno=12962"
r = requests.get(page)
soup = bs(r.content, 'lxml')
a = soup.find('a', href=link)
for tag in a.find_parents('p'):
print(tag)
<p><font class="copyright_text_color" color="" face=""><b>Origins:</b></font> This item is “true” in the sense that Eric Rydbom is indeed an engineer stationed in Iraq with the Army’s <nobr>4th Infantry</nobr> Division, and he sends monthly <nobr>e-mail</nobr> dispatches such as the one quoted above to fellow members of his congregation at the <nobr><a href="http://www.elca.org/ScriptLib/OS/Congregations/cdsDetail.asp?congrno=12962" target="_blank">First Lutheran</a></nobr> Church of Richmond Beach in Shorline, Washington. This piece was one of those messages, forwarded to the church’s prayer chain and thence to the larger world via the Internet.</p>
如果要获取文字,请使用。
for tag in a.find_parents('p'):
print(tag.text)
答案 1 :(得分:0)
使用bs4 4.7.1的简便方法。是使用:has和一个属性=值选择器。无需循环。
import requests
from bs4 import BeautifulSoup as bs
page = "https://www.snopes.com/fact-check/rebuilding-iraq/"
link="http://www.elca.org/ScriptLib/OS/Congregations/cdsDetail.asp?congrno=12962"
r = requests.get(page)
soup = bs(r.content, 'lxml')
print(soup.select_one('p:has([href="' + link + '"])'))