在beautifulsoup有没有办法只获得标签的模板,'='符号之前的部分?

时间:2017-03-21 13:24:50

标签: python html parsing beautifulsoup

我有这个标签:

<div class="post_header">\n<h3><a href="http://chesterwest.net/design/ranch-style-house-plans/" title="Ranch Style House Plans">Ranch Style House Plans\t\t\t\t\t\t\t\t\n\t\t\t\t\t\t\t\t\t</a>\n</h3>\n</div>

有一种简单的方法:

<div class= >\n<h3><a href= title= </a>\n</h3>\n</div>

尝试所有方法,考虑正则表达式,但还有另一种方法吗?

1 个答案:

答案 0 :(得分:1)

使用findAll(True)匹配每个标记,并找到它找到的名称。更多信息here

示例:

# -*- coding: utf-8 -*-
import requests
from bs4 import BeautifulSoup

def RemoveAttributes(soup):
    for tag in soup.findAll(True):
        tag.attrs = {key:"" for key,value in tag.attrs.iteritems()}
        if(tag.string is not None):
            tag.string = tag.text.replace(tag.string, "")
    return " ".join(str(soup).split())

example = """<div class="post_header">\n<h3><a 
href="http://chesterwest.net/design/ranch-style-house-plans/" title="Ranch 
Style House Plans">Ranch Style House 
Plans\t\t\t\t\t\t\t\t\n\t\t\t\t\t\t\t\t\t</a>\n</h3>\n</div>"""

soup = BeautifulSoup(example, 'html.parser')
print (RemoveAttributes(soup))

输出:

<div class=""> <h3><a href="" title=""></a> </h3> </div>