Question

我有这个标签：

<div class="post_header">\n<h3><a href="http://chesterwest.net/design/ranch-style-house-plans/" title="Ranch Style House Plans">Ranch Style House Plans\t\t\t\t\t\t\t\t\n\t\t\t\t\t\t\t\t\t</a>\n</h3>\n</div>

有一种简单的方法：

<div class= >\n<h3><a href= title= </a>\n</h3>\n</div>

尝试所有方法，考虑正则表达式，但还有另一种方法吗？

Answer 1

使用findAll(True)匹配每个标记，并找到它找到的名称。更多信息here。

示例：

# -*- coding: utf-8 -*-
import requests
from bs4 import BeautifulSoup

def RemoveAttributes(soup):
    for tag in soup.findAll(True):
        tag.attrs = {key:"" for key,value in tag.attrs.iteritems()}
        if(tag.string is not None):
            tag.string = tag.text.replace(tag.string, "")
    return " ".join(str(soup).split())

example = """<div class="post_header">\n<h3><a 
href="http://chesterwest.net/design/ranch-style-house-plans/" title="Ranch 
Style House Plans">Ranch Style House 
Plans\t\t\t\t\t\t\t\t\n\t\t\t\t\t\t\t\t\t</a>\n</h3>\n</div>"""

soup = BeautifulSoup(example, 'html.parser')
print (RemoveAttributes(soup))

输出：

<div class=""> <h3><a href="" title=""></a> </h3> </div>

在beautifulsoup有没有办法只获得标签的模板，'='符号之前的部分？

1 个答案: