content='<p>Hello, the web site is <a href="https://www.google.com">Google</a></p>. <p>The search engine is <a href="https://www.baidu.com">Baidu</a></p>.'
soup = BeautifulSoup(content, 'html.parser')
现在我想用href中的url地址替换整个<a> </a>
。所以我想得到预期的结果:
Hello, the web site is https://www.google.com. The search engine is https://www.baidu.com.
任何人都可以提供解决方案吗?
答案 0 :(得分:1)
首先找到a
然后获取href
,然后您可以将href
添加到上一个兄弟并移除a
from bs4 import BeautifulSoup
content='<p>Hello, the web site is <a href="https://www.google.com">Google</a></p>. <p>The search engine is <a href="https://www.baidu.com">Baidu</a></p>.'
soup = BeautifulSoup(content, 'html.parser')
# find all `a`
all_a = soup.findAll('a')
for a in all_a:
# find `href` in `a`
href = a['href']
#print('--- before ---')
#print(soup)
# add `href` to `previousSibling`
a.previousSibling.replaceWith(a.previousSibling + href)
# remove `a`
a.extract()
#print('--- after ---')
#print(soup)
print(soup)
'<p>Hello, the web site is https://www.google.com</p>. <p>The search engine is https://www.baidu.com</p>.'