如何通过使用BeautifulSoup与装饰器避免(嵌套)if / else语句?

时间:2015-08-20 15:24:25

标签: python beautifulsoup

问题描述

由于BeautifulSoup正在返回soup objectNone,因此函数中必须有if else个语句与.find或{后续搜索一样多{1}}将会发生。

问题

如何通过使用装饰器(或类似的方法)来避免这种情况?

实施例

假设有两个不同的html网站(使用这些示例代码段):

.find_all

如果您在下面的代码片段中进行搜索,那么第一个html片段的所有内容都可以,但是对于第二个html代码段,您将获得此代码:

# example with wanted class in html file
<td class='translation'>
    <span class='italiano'>ciao</span>
    <span class='french'>au revoir</span>
    <span class='polish'>cześć</span>
</td>

# example without wanted class in another html file
<td class='no_translation'>
    foo
</td>

有两种显而易见的方法来处理>>> soup.find('td', class_='translation').find('span', class_='polish') AttributeError: 'NoneType' object has no attribute 'find'

AttributeError

通过使用装饰器功能,第三个解决方案呢?怎么办呢?

# using if-else statements for every result of .find or .findall
def possibility_1():
    translation = soup.find('td', class_='translation')
    if translation:
        polish = translation.find('span', class_='polish')
        return polish
    return None

# use a try-except block for the problem
def possibility_2():
    try:
        translation = soup.find('td', class_='translation')
        polish = translation.find('span', class_='polish')
        return polish
    except AttributeError:
        return None

2 个答案:

答案 0 :(得分:3)

感谢 @jonrsharpe 发表的评论(几乎是讨论)和 @Logan 的答案,我会坚持装饰者的想法,但是搜索将返回None的信息。

这是我的装饰者作为一种可能的解决方案。

import sys
import inspect
from functools import wraps
from bs4 import BeautifulSoup

# Decorator with returning None and trace info if
# soup.find or soup.find_all fails at a certain point
def robust_soup(func):
    @wraps(func)
    def wrapper(*args, **kwargs):
        try:
            return func(*args, **kwargs)
        except AttributeError:
            # just an example without formatting
            print inspect.getinnerframes(sys.exc_info()[2])
    return wrapper

现在我可以使用

# a good working example
soup_good = BeautifulSoup("""
<td class='translation'>
    <span class='italiano'>ciao</span>
    <span class='french'>au revoir</span>
    <span class='polish'>cześć</span>
</td>""")

# an example which would lead to AttributeError if not handled
soup_bad = BeautifulSoup("""
<td class='no_translation'>
    something uninteresting
</td>""")

@robust_soup
def get_desired_result(soup):
    translation = soup.find('td', class_='translation')
    polish = translation.find('span', class_='polish')
    print polish

>>> # with a soup containing information
>>> get_desired_result(soup_good)
<span class='polish'>cześć</span>

>>> # with a soup which normally fails
>>> get_desired_result(soup_bad)
# some debugging output from inspect module (also
# with information where last error occured!)
None

答案 1 :(得分:2)

您可以考虑使用自己的函数而不是None.find来代替装饰器并通过if / then手动检查.find_all

此外,返回普通版None还有2个问题。

  • 您不知道错误传播的位置,因此调试会很困难。
  • 返回None后,您最终可能会soup.find_all("a")link["href"]执行None。这根本不会帮助你。

所以你可以尝试这样的事情:

class PseudoNone(object):
    """"
    You can call it.
    You can beat it with a stick.
    It will return PseudoNone!
    And you can trace where the None did come from!!"""
    debug = {}
    def __init__(self, created_at):
        PseudoNone.debug[self] = created_at
    def __getattr__(self, attr):
        return self
    def __call__(self, *args, **kwargs):
        return self
    def __getitem__(self, item):
        return self
    def __bool__(self):
        return False

这&#39; None&#39;不应该有这些问题。此外,每个实例都创建了一些导致None的标识符。所有孩子None&#39;由PseudoNone.__call____getitem__引起的实际上只是内存中的同一个对象,因此在PseudoNone.debug[obj]中具有相同的失败原因。适合调试!

from bs4 import BeautifulSoup

xml = """
<td class='translation'>
    <span class='italiano'>ciao</span>
    <span class='french'>au revoir</span>
    <span class='polish'>cześć</span>
</td>"""

def find_all(soup, *args, **kwargs):
    results = soup.find_all(*args, **kwargs)
    if not results:
        return PseudoNone((soup, args, kwargs))
    else:
        return results

def find(soup, *args, **kwargs):
    "As far as I know, BeautifulSoup.find is internally just BeautifulSoup.find_all(*args)[0]"
    results = find_all(soup, *args, **kwargs)
    return results[0]

soup = BeautifulSoup(xml)

translation = find(soup, 'td', class_='translation')

erroneous_translation = find(soup, 'td', class_='BADTRANSLATIONS')

...

print translation
    <td class="translation">
    <span class="italiano">ciao</span>
    <span class="french">au revoir</span>
    <span class="polish">czeĹÄ</span>
    </td>

print erroneous_translation
    <__main__.PseudoNone object at 0x7fd4bcc18790>

print erroneous_translation("foo")
    <__main__.PseudoNone object at 0x7fd4bcc18790>

print erroneous_translation["baz"]
    <__main__.PseudoNone object at 0x7fd4bcc18790>

print find_all(erroneous_translation, "something")
    <__main__.PseudoNone object at 0x7fd4bcc18790>

哦,这是一个假的!那不是我想要的。我哪里出错!!?

print PseudoNone.debug[erroneous_translation]
    (<html><body><td class="translation">
    <span class="italiano">ciao</span>
    <span class="french">au revoir</span>
    <span class="polish">czeĹÄ</span>
    </td></body></html>, ('td',), {'class_': 'BADTRANSLATIONS'})

注意:

  • 使用isinstance(qux, PseudoNone),而不是==None。 (我们不能NoneType
  • 如果PseudoNone.debug内存过大,请考虑在PseudoNone.debug的值中散列*args**kwargs(和/或在python3中使用@functools.lru_cache
  • 这可能是一个黑客。