我在BeautifulSoup4
中使用Python2.7
来尝试查找HTML正文中的所有唯一链接。
因此,如果HTML正文具有3个完全相同的链接,那么只使用其中一个。
我的代码如下所示:
def extract_links_from_content(self):
content = "%s %s" % (self.html_body, self.plaintext)
url_regex = 'http[s]?://(?:[a-zA-Z]|[0-9]|[$-_@.&+]|[!*\(\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+'
links = []
soup = BeautifulSoup(content, "html5lib")
for link in soup.findAll('a'):
if not link.get('no_track'):
target = link.get('href')
name = link.get('data-name')
link_text = unicode(link)
# If there is no target or a mailto link
# skip it, cause we can't track whats not there or an email link
if not target or target.startswith('mailto') or '{' in target:
continue
# If the target is a bookmark, skip it
if target.startswith('#'):
continue
target = re.search(url_regex, target)
if target:
links.append({
'name': name,
'target': target.group()
})
return links
答案 0 :(得分:1)
您实际上可以将所有检查合并到一个函数中,并将此函数作为关键字href
参数值传递给find_all()
。看起来更干净,更易读:
import re
from bs4 import BeautifulSoup
def filter_links(link):
href = link and link.get('href')
return all([
link,
link.name == 'a',
href,
not link.has_attr('no_track'),
not href.startswith('mailto'),
'{' not in target,
not href.startswith('#')
])
content = "%s %s" % (self.html_body, self.plaintext)
pattern = re.compile(r'http[s]?://(?:[a-zA-Z]|[0-9]|[$-_@.&+]|[!*\(\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+')
soup = BeautifulSoup(content, "html5lib")
links = []
for link in soup.find_all(filter_links):
target = link['href']
name = link.get('data-name')
target = pattern.search(target)
if target:
links.add({
'name': name,
'target': target.group()
})
为避免重复链接,您可以从<{1}}列表中设置。为此,您需要存储元组而不是字典:
links