Question

我正在尝试从美丽的汤中提取非本地链接（不是自引用的链接或与我正在抓取的网页不属于同一域的链接）。例如以下是允许我做相反的代码：

import re
from bs4 import BeautifulSoup
from urllib import parse, request

def get_links(root, html):
    soup = BeautifulSoup(html, 'html.parser')
    for link in soup.find_all('a', href=re.compile('https://www\.eecs\.mitx\.edu/')):
        href = link.get('href')
        if href:
            text = link.string
            if not text:
                text = ''
                text = re.sub('\s+', ' ', text).strip()
                yield (parse.urljoin(root, link.get('href')), text)

site = 'https://www.eecs.mitx.edu/~professor' #this is an example
r = request.urlopen(site)
for l in get_links(site, r.read()):
print(l)

Answer 1

您可以使用custom function来满足您的要求。例如，如果您想要抓取此页面（当前页面）并获取所有不以https://stackoverflow.com开头的链接，您可以使用：

import requests
from bs4 import BeautifulSoup

def get_links(root, html):
    soup = BeautifulSoup(html, 'lxml')
    for link in soup.find_all('a', href=lambda h: h and h.startswith('http') and not h.startswith(root)):
        yield link['href']

r = requests.get('https://stackoverflow.com/questions/49869971/extracting-non-local-links-beautiful-soup')
base = 'https://stackoverflow.com'
for link in get_links(base, r.text):
    print(link)

部分输出：

https://chat.stackoverflow.com
https://meta.stackoverflow.com
https://stackexchange.com/sites
https://stackoverflow.blog
https://meta.stackoverflow.com
https://www.stackoverflowbusiness.com/?ref=topbar_help
https://stackexchange.com/users/?tab=inbox
https://stackexchange.com/users/?tab=reputation
https://stackexchange.com
https://plus.google.com/share?

您可以根据需要修改功能lambda h: h and h.startswith('http') and not h.startswith(root)。

此外，在您的代码中，您使用re模块作为此行：

soup.find_all('a', href=re.compile('https://www\.eecs\.mitx\.edu/'))

您可以在CSS selector的帮助下使用re进行部分匹配（^），而无需使用soup.select('a[href^="https://www.eecs.mitx.edu/"]')。例如：

neighbours

提取非本地链接 - 美味汤

1 个答案: