我有这个网页:
text = BeautifulSoup(requests.get('https://www.washingtonpost.com/blogs/on-small-business/post/how-to-breed-big-innovation-inside-a-small-business/2013/03/26/b1a8953e-962a-11e2-9e23-09dce87f75a1_blog.html', timeout=7.00).text)
我有一个漂亮的汤功能,它可以提取没有属性的所有<ul>
标记,以及不包含属性且没有<li>
标记子项的<a>
标记:
def pull_ul(tag):
return tag.name == 'ul' and not tag.attrs and not tag.li.attrs and not tag.a
ul_tags = text.find_all(pull_ul)
print ul_tags
当我运行时,我收到一条错误消息:
AttributeError: 'NoneType' object has no attribute 'attrs'
所以我将函数修改为:
def pull_ul(tag):
return tag.name == 'ul' and not tag.attrs and not tag.a
那产出:
[<ul></ul>, <ul> <li class="report-button" id="flag-spam">Spam</li> <li class="report-button" id="flag-offensive">Offensive</li> <li class="report-button" id="flag-disagree">Disagree</li> <li class="report-button" id="flag-offtopic">Off-Topic</li> </ul>]
这告诉我,生成错误的部分是空标记<ul></ul>
有没有办法重写该函数,使其忽略所有使程序运行的空标记实例?
答案 0 :(得分:1)
如果您只是添加额外检查tag.li
是否真实,那该怎么办?
def pull_ul(tag):
return tag.name == 'ul' and \
not tag.attrs and \
tag.li and \ # < HERE
not tag.li.attrs and \
not tag.a