Question

我有这个网页：

text = BeautifulSoup(requests.get('https://www.washingtonpost.com/blogs/on-small-business/post/how-to-breed-big-innovation-inside-a-small-business/2013/03/26/b1a8953e-962a-11e2-9e23-09dce87f75a1_blog.html', timeout=7.00).text)

我有一个漂亮的汤功能，它可以提取没有属性的所有<ul>标记，以及不包含属性且没有<li>标记子项的<a>标记：

def pull_ul(tag):
        return tag.name == 'ul' and not tag.attrs and not tag.li.attrs and not tag.a  
ul_tags = text.find_all(pull_ul)
print ul_tags

当我运行时，我收到一条错误消息：

AttributeError: 'NoneType' object has no attribute 'attrs'

所以我将函数修改为：

def pull_ul(tag):
        return tag.name == 'ul' and not tag.attrs and not tag.a

那产出：

[<ul></ul>, <ul> <li class="report-button" id="flag-spam">Spam</li> <li class="report-button" id="flag-offensive">Offensive</li> <li class="report-button" id="flag-disagree">Disagree</li> <li class="report-button" id="flag-offtopic">Off-Topic</li> </ul>]

这告诉我，生成错误的部分是空标记<ul></ul>

有没有办法重写该函数，使其忽略所有使程序运行的空标记实例？

Answer 1

如果您只是添加额外检查tag.li是否真实，那该怎么办？

def pull_ul(tag):
    return tag.name == 'ul' and \
           not tag.attrs and \
           tag.li and \ # < HERE
           not tag.li.attrs and \
           not tag.a

如何使用Beautiful Soup忽略空标签？

1 个答案: