如何使用Beautiful Soup忽略空标签?

时间:2016-02-28 04:29:34

标签: python beautifulsoup

我有这个网页:

text = BeautifulSoup(requests.get('https://www.washingtonpost.com/blogs/on-small-business/post/how-to-breed-big-innovation-inside-a-small-business/2013/03/26/b1a8953e-962a-11e2-9e23-09dce87f75a1_blog.html', timeout=7.00).text)

我有一个漂亮的汤功能,它可以提取没有属性的所有<ul>标记,以及不包含属性且没有<li>标记子项的<a>标记:

def pull_ul(tag):
        return tag.name == 'ul' and not tag.attrs and not tag.li.attrs and not tag.a  
ul_tags = text.find_all(pull_ul)
print ul_tags

当我运行时,我收到一条错误消息:

AttributeError: 'NoneType' object has no attribute 'attrs'

所以我将函数修改为:

def pull_ul(tag):
        return tag.name == 'ul' and not tag.attrs and not tag.a 

那产出:

[<ul></ul>, <ul> <li class="report-button" id="flag-spam">Spam</li> <li class="report-button" id="flag-offensive">Offensive</li> <li class="report-button" id="flag-disagree">Disagree</li> <li class="report-button" id="flag-offtopic">Off-Topic</li> </ul>]

这告诉我,生成错误的部分是空标记<ul></ul>

有没有办法重写该函数,使其忽略所有使程序运行的空标记实例?

1 个答案:

答案 0 :(得分:1)

如果您只是添加额外检查tag.li是否真实,那该怎么办?

def pull_ul(tag):
    return tag.name == 'ul' and \
           not tag.attrs and \
           tag.li and \ # < HERE
           not tag.li.attrs and \
           not tag.a