Question

我目前正在开发Web应用程序（使用flask作为后端）。

在后端，我使用selenium检索给定URL的页面源。我想浏览page_source并禁用href不在列表中的所有链接。像这样：

body = browser.page_source
soup = BeautifulSoup(body, 'html.parser')
for link in soup.a:
    if not (link['href'] in link_list):
        link['href']=""

我是美丽汤的新手，所以我不确定语法。我正在使用美丽汤4

Answer 1

想通了：

soup = BeautifulSoup(c_body, 'lxml') #you can also use html.parser
for a in soup.findAll('a'):
    if not (a['href'] in src_lst):   #src_list is a list of the urls you want to keep
        del a['href']
        a.name='span' #to avoid the style associated with links
soup.span.unwrap()    #to remove span tags and keep text only
c_body=str(soup)      #c_body will be displayed in an iframe using srccdoc

编辑：如果没有span标记，则上面的代码可能会中断，因此这将是一个更好的方法：

soup = BeautifulSoup(c_body, 'lxml')
for a in soup.findAll('a'):
    if a.has_attr("href"):
       if not (a['href'] in src_lst):
            del a['href']
            a.name='span'

if len(soup.findAll('span')) > 0:
    soup.span.unwrap()
c_body=str(soup)

如何使用漂亮的汤禁用列表中没有的所有链接

1 个答案: