我正在尝试从美丽的汤中提取非本地链接(不是自引用的链接或与我正在抓取的网页不属于同一域的链接)。例如以下是允许我做相反的代码:
import re
from bs4 import BeautifulSoup
from urllib import parse, request
def get_links(root, html):
soup = BeautifulSoup(html, 'html.parser')
for link in soup.find_all('a', href=re.compile('https://www\.eecs\.mitx\.edu/')):
href = link.get('href')
if href:
text = link.string
if not text:
text = ''
text = re.sub('\s+', ' ', text).strip()
yield (parse.urljoin(root, link.get('href')), text)
site = 'https://www.eecs.mitx.edu/~professor' #this is an example
r = request.urlopen(site)
for l in get_links(site, r.read()):
print(l)
答案 0 :(得分:1)
您可以使用custom function来满足您的要求。例如,如果您想要抓取此页面(当前页面)并获取所有不以https://stackoverflow.com
开头的链接,您可以使用:
import requests
from bs4 import BeautifulSoup
def get_links(root, html):
soup = BeautifulSoup(html, 'lxml')
for link in soup.find_all('a', href=lambda h: h and h.startswith('http') and not h.startswith(root)):
yield link['href']
r = requests.get('https://stackoverflow.com/questions/49869971/extracting-non-local-links-beautiful-soup')
base = 'https://stackoverflow.com'
for link in get_links(base, r.text):
print(link)
部分输出:
https://chat.stackoverflow.com
https://meta.stackoverflow.com
https://stackexchange.com/sites
https://stackoverflow.blog
https://meta.stackoverflow.com
https://www.stackoverflowbusiness.com/?ref=topbar_help
https://stackexchange.com/users/?tab=inbox
https://stackexchange.com/users/?tab=reputation
https://stackexchange.com
https://plus.google.com/share?
您可以根据需要修改功能lambda h: h and h.startswith('http') and not h.startswith(root)
。
此外,在您的代码中,您使用re
模块作为此行:
soup.find_all('a', href=re.compile('https://www\.eecs\.mitx\.edu/'))
您可以在CSS selector的帮助下使用re
进行部分匹配(^
),而无需使用soup.select('a[href^="https://www.eecs.mitx.edu/"]')
。例如:
neighbours