在BeautifulSoup中用href替换<a> </a>

时间:2016-01-22 03:06:53

标签: python html beautifulsoup

content='<p>Hello, the web site is <a href="https://www.google.com">Google</a></p>. <p>The search engine is <a href="https://www.baidu.com">Baidu</a></p>.'
soup = BeautifulSoup(content, 'html.parser')

现在我想用href中的url地址替换整个<a> </a>。所以我想得到预期的结果:

Hello, the web site is https://www.google.com. The search engine is https://www.baidu.com.

任何人都可以提供解决方案吗?

1 个答案:

答案 0 :(得分:1)

首先找到a然后获取href,然后您可以将href添加到上一个兄弟并移除a

from bs4 import BeautifulSoup

content='<p>Hello, the web site is <a href="https://www.google.com">Google</a></p>. <p>The search engine is <a href="https://www.baidu.com">Baidu</a></p>.'
soup = BeautifulSoup(content, 'html.parser')

# find all `a`
all_a = soup.findAll('a')

for a in all_a:
    # find `href` in `a`
    href = a['href']

    #print('--- before ---')
    #print(soup)

    # add `href` to `previousSibling`
    a.previousSibling.replaceWith(a.previousSibling + href)

    # remove `a`
    a.extract()

    #print('--- after ---')
    #print(soup)

print(soup)

'<p>Hello, the web site is https://www.google.com</p>. <p>The search engine is https://www.baidu.com</p>.'