Python Beautifulsoup找到独特的链接

时间:2015-03-06 18:09:15

标签: python beautifulsoup

我在BeautifulSoup4中使用Python2.7来尝试查找HTML正文中的所有唯一链接。

因此,如果HTML正文具有3个完全相同的链接,那么只使用其中一个。

我的代码如下所示:

def extract_links_from_content(self):
    content = "%s %s" % (self.html_body, self.plaintext)
    url_regex = 'http[s]?://(?:[a-zA-Z]|[0-9]|[$-_@.&+]|[!*\(\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+'
    links = []

    soup = BeautifulSoup(content, "html5lib")

    for link in soup.findAll('a'):
        if not link.get('no_track'):
            target = link.get('href')
            name = link.get('data-name')
            link_text = unicode(link)

            # If there is no target or a mailto link
            # skip it, cause we can't track whats not there or an email link
            if not target or target.startswith('mailto') or '{' in target:
                continue

            # If the target is a bookmark, skip it
            if target.startswith('#'):
                continue

            target = re.search(url_regex, target)

            if target:
                links.append({
                    'name': name,
                    'target': target.group()
                })

    return links

1 个答案:

答案 0 :(得分:1)

您实际上可以将所有检查合并到一个函数中,并将此函数作为关键字href参数值传递给find_all()。看起来更干净,更易读:

import re

from bs4 import BeautifulSoup

def filter_links(link):
    href = link and link.get('href')

    return all([
        link,
        link.name == 'a',
        href,
        not link.has_attr('no_track'),
        not href.startswith('mailto'),
        '{' not in target,
        not href.startswith('#')
    ])

content = "%s %s" % (self.html_body, self.plaintext)
pattern = re.compile(r'http[s]?://(?:[a-zA-Z]|[0-9]|[$-_@.&+]|[!*\(\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+')

soup = BeautifulSoup(content, "html5lib")

links = []
for link in soup.find_all(filter_links):
    target = link['href']
    name = link.get('data-name')

    target = pattern.search(target)
    if target:
        links.add({
            'name': name,
            'target': target.group()
        })

为避免重复链接,您可以从<{1}}列表中设置。为此,您需要存储元组而不是字典:

links