由于BeautifulSoup正在返回soup object
或None
,因此函数中必须有if else
个语句与.find
或{后续搜索一样多{1}}将会发生。
如何通过使用装饰器(或类似的方法)来避免这种情况?
假设有两个不同的html网站(使用这些示例代码段):
.find_all
如果您在下面的代码片段中进行搜索,那么第一个html片段的所有内容都可以,但是对于第二个html代码段,您将获得此代码:
# example with wanted class in html file
<td class='translation'>
<span class='italiano'>ciao</span>
<span class='french'>au revoir</span>
<span class='polish'>cześć</span>
</td>
# example without wanted class in another html file
<td class='no_translation'>
foo
</td>
有两种显而易见的方法来处理>>> soup.find('td', class_='translation').find('span', class_='polish')
AttributeError: 'NoneType' object has no attribute 'find'
:
AttributeError
通过使用装饰器功能,第三个解决方案呢?怎么办呢?
# using if-else statements for every result of .find or .findall
def possibility_1():
translation = soup.find('td', class_='translation')
if translation:
polish = translation.find('span', class_='polish')
return polish
return None
# use a try-except block for the problem
def possibility_2():
try:
translation = soup.find('td', class_='translation')
polish = translation.find('span', class_='polish')
return polish
except AttributeError:
return None
答案 0 :(得分:3)
感谢 @jonrsharpe 发表的评论(几乎是讨论)和 @Logan 的答案,我会坚持装饰者的想法,但是搜索将返回None
的信息。
这是我的装饰者作为一种可能的解决方案。
import sys
import inspect
from functools import wraps
from bs4 import BeautifulSoup
# Decorator with returning None and trace info if
# soup.find or soup.find_all fails at a certain point
def robust_soup(func):
@wraps(func)
def wrapper(*args, **kwargs):
try:
return func(*args, **kwargs)
except AttributeError:
# just an example without formatting
print inspect.getinnerframes(sys.exc_info()[2])
return wrapper
现在我可以使用
# a good working example
soup_good = BeautifulSoup("""
<td class='translation'>
<span class='italiano'>ciao</span>
<span class='french'>au revoir</span>
<span class='polish'>cześć</span>
</td>""")
# an example which would lead to AttributeError if not handled
soup_bad = BeautifulSoup("""
<td class='no_translation'>
something uninteresting
</td>""")
@robust_soup
def get_desired_result(soup):
translation = soup.find('td', class_='translation')
polish = translation.find('span', class_='polish')
print polish
>>> # with a soup containing information
>>> get_desired_result(soup_good)
<span class='polish'>cześć</span>
>>> # with a soup which normally fails
>>> get_desired_result(soup_bad)
# some debugging output from inspect module (also
# with information where last error occured!)
None
答案 1 :(得分:2)
您可以考虑使用自己的函数而不是None
和.find
来代替装饰器并通过if / then手动检查.find_all
。
此外,返回普通版None
还有2个问题。
None
后,您最终可能会soup.find_all("a")
或link["href"]
执行None
。这根本不会帮助你。所以你可以尝试这样的事情:
class PseudoNone(object):
""""
You can call it.
You can beat it with a stick.
It will return PseudoNone!
And you can trace where the None did come from!!"""
debug = {}
def __init__(self, created_at):
PseudoNone.debug[self] = created_at
def __getattr__(self, attr):
return self
def __call__(self, *args, **kwargs):
return self
def __getitem__(self, item):
return self
def __bool__(self):
return False
这&#39; None
&#39;不应该有这些问题。此外,每个实例都创建了一些导致None
的标识符。所有孩子None
&#39;由PseudoNone.__call__
或__getitem__
引起的实际上只是内存中的同一个对象,因此在PseudoNone.debug[obj]
中具有相同的失败原因。适合调试!
from bs4 import BeautifulSoup
xml = """
<td class='translation'>
<span class='italiano'>ciao</span>
<span class='french'>au revoir</span>
<span class='polish'>cześć</span>
</td>"""
def find_all(soup, *args, **kwargs):
results = soup.find_all(*args, **kwargs)
if not results:
return PseudoNone((soup, args, kwargs))
else:
return results
def find(soup, *args, **kwargs):
"As far as I know, BeautifulSoup.find is internally just BeautifulSoup.find_all(*args)[0]"
results = find_all(soup, *args, **kwargs)
return results[0]
soup = BeautifulSoup(xml)
translation = find(soup, 'td', class_='translation')
erroneous_translation = find(soup, 'td', class_='BADTRANSLATIONS')
...
print translation
<td class="translation">
<span class="italiano">ciao</span>
<span class="french">au revoir</span>
<span class="polish">czeĹÄ</span>
</td>
print erroneous_translation
<__main__.PseudoNone object at 0x7fd4bcc18790>
print erroneous_translation("foo")
<__main__.PseudoNone object at 0x7fd4bcc18790>
print erroneous_translation["baz"]
<__main__.PseudoNone object at 0x7fd4bcc18790>
print find_all(erroneous_translation, "something")
<__main__.PseudoNone object at 0x7fd4bcc18790>
哦,这是一个假的!那不是我想要的。我哪里出错!!?
print PseudoNone.debug[erroneous_translation]
(<html><body><td class="translation">
<span class="italiano">ciao</span>
<span class="french">au revoir</span>
<span class="polish">czeĹÄ</span>
</td></body></html>, ('td',), {'class_': 'BADTRANSLATIONS'})
注意:
isinstance(qux, PseudoNone)
,而不是==None
。 (我们不能NoneType
)PseudoNone.debug
内存过大,请考虑在PseudoNone.debug的值中散列*args
和**kwargs
(和/或在python3中使用@functools.lru_cache
)