提取非本地链接 - 美味汤

时间:2018-04-17 04:31:03

标签: python web-scraping beautifulsoup

我正在尝试从美丽的汤中提取非本地链接(不是自引用的链接或与我正在抓取的网页不属于同一域的链接)。例如以下是允许我做相反的代码:

import re
from bs4 import BeautifulSoup
from urllib import parse, request

def get_links(root, html):
    soup = BeautifulSoup(html, 'html.parser')
    for link in soup.find_all('a', href=re.compile('https://www\.eecs\.mitx\.edu/')):
        href = link.get('href')
        if href:
            text = link.string
            if not text:
                text = ''
                text = re.sub('\s+', ' ', text).strip()
                yield (parse.urljoin(root, link.get('href')), text)

site = 'https://www.eecs.mitx.edu/~professor' #this is an example
r = request.urlopen(site)
for l in get_links(site, r.read()):
print(l)

1 个答案:

答案 0 :(得分:1)

您可以使用custom function来满足您的要求。例如,如果您想要抓取此页面(当前页面)并获取所有不以https://stackoverflow.com开头的链接,您可以使用:

import requests
from bs4 import BeautifulSoup

def get_links(root, html):
    soup = BeautifulSoup(html, 'lxml')
    for link in soup.find_all('a', href=lambda h: h and h.startswith('http') and not h.startswith(root)):
        yield link['href']

r = requests.get('https://stackoverflow.com/questions/49869971/extracting-non-local-links-beautiful-soup')
base = 'https://stackoverflow.com'
for link in get_links(base, r.text):
    print(link)

部分输出:

https://chat.stackoverflow.com
https://meta.stackoverflow.com
https://stackexchange.com/sites
https://stackoverflow.blog
https://meta.stackoverflow.com
https://www.stackoverflowbusiness.com/?ref=topbar_help
https://stackexchange.com/users/?tab=inbox
https://stackexchange.com/users/?tab=reputation
https://stackexchange.com
https://plus.google.com/share?

您可以根据需要修改功能lambda h: h and h.startswith('http') and not h.startswith(root)

此外,在您的代码中,您使用re模块作为此行:

soup.find_all('a', href=re.compile('https://www\.eecs\.mitx\.edu/'))

您可以在CSS selector的帮助下使用re进行部分匹配(^),而无需使用soup.select('a[href^="https://www.eecs.mitx.edu/"]') 。例如:

neighbours