Question

我正在抓取一个不在其html标签中使用任何有用的类或ID的网页，因此我不得不废弃所有链接并在链接中查找模式。以下是html示例的样子

<span>Category</span><link href='example.com/link-about-a'>A</a>

在另一个页面上，我们可能会有不同的类别

<span>Category</span><link href='example.com/link-about-b'>B</a>

使用beautifulsoup4，我目前的解决方案看起来像这样

def category(soup):
    for x in soup.find_all('a'):
        if 'link-about-a' in x['href']:
            return 'A'
        if 'link-about-b' in x['href']:
            return 'B'

依旧......但这非常难看。

我想知道是否有办法让这个更简洁。

喜欢使用字典

categories = {'A': 'link-about-a', 'B': 'link-about-b'}

并将其缩减为单个表达式。

Answer 1

你需要的只是另一个循环：

for x in soup.find_all('a'):
    for k, v in categories.iteritems():
        if v in x['href']:
            return k

虽然你想要一个表达式：

category = next((
    k for x in soup.find_all('a')
      for k, v in categories.iteritems()
      if v in x['href']
), None)

Answer 2

使用正则表达式和类别列表可能会更灵活一些：

categories = [[re.compile('link-about-a'), 'A'], 
              [re.compile('link-about-b'), 'B']]

def category(soup):
    for x in soup.findAll('a'):
        for expression, description in categories:
            if expression.search(x['href']):
                return description
    else:
        return None

使多个if语句不那么冗长

2 个答案: