我正在阅读《 Web Scraping with Python》一书,该书具有以下功能来检索在页面上找到的外部链接:
#Retrieves a list of all external links found on a page
def getExternalLinks(bs, excludeUrl):
externalLinks = []
#Finds all links that start with "http" that do
#not contain the current URL
for link in bs.find_all('a', {'href' : re.compile('^(http|www)((?!'+excludeUrl+').)*$')}):
if link.attrs['href'] is not None:
if link.attrs['href'] not in externalLinks:
externalLinks.append(link.attrs['href'])
return externalLinks
问题在于它无法正常运行。当我使用URL http://www.oreilly.com运行它时,它返回以下内容:
bs = makeSoup('https://www.oreilly.com') # Makes a BeautifulSoup Object
getExternalLinks(bs, 'https://www.oreilly.com')
['https://www.oreilly.com',
'https://oreilly.com/sign-in.html',
'https://oreilly.com/online-learning/try-now.html',
'https://oreilly.com/online-learning/index.html',
'https://oreilly.com/online-learning/individuals.html',
'https://oreilly.com/online-learning/teams.html',
'https://oreilly.com/online-learning/enterprise.html',
'https://oreilly.com/online-learning/government.html',
'https://oreilly.com/online-learning/academic.html',
'https://oreilly.com/online-learning/pricing.html',
'https://www.oreilly.com/partner/reseller-program.html',
'https://oreilly.com/conferences/',
'https://oreilly.com/ideas/',
'https://oreilly.com/about/approach.html',
'https://www.oreilly.com/conferences/',
'https://conferences.oreilly.com/velocity/vl-ny',
'https://conferences.oreilly.com/artificial-intelligence/ai-eu',
'https://www.safaribooksonline.com/public/free-trial/',
'https://www.safaribooksonline.com/team-setup/',
'https://www.oreilly.com/online-learning/enterprise.html',
'https://www.oreilly.com/about/approach.html',
'https://conferences.oreilly.com/software-architecture/sa-eu',
'https://conferences.oreilly.com/velocity/vl-eu',
'https://conferences.oreilly.com/software-architecture/sa-ny',
'https://conferences.oreilly.com/strata/strata-ca',
'http://shop.oreilly.com/category/customer-service.do',
'https://twitter.com/oreillymedia',
'https://www.facebook.com/OReilly/',
'https://www.linkedin.com/company/oreilly-media',
'https://www.youtube.com/user/OreillyMedia',
'https://www.oreilly.com/emails/newsletters/',
'https://itunes.apple.com/us/app/safari-to-go/id881697395',
'https://play.google.com/store/apps/details?id=com.safariflow.queue']
为什么前16-17个条目被视为“外部链接”?它们属于http://www.oreilly.com的相同域。
答案 0 :(得分:1)
这两者之间是有区别的:
http://www.oreilly.com
https://www.oreilly.com
希望你明白我的意思。
答案 1 :(得分:0)
import urllib
from bs4 import BeautifulSoup
from urllib.request import urlopen
from urllib.parse import urlsplit
import re
ext = set()
def getExt(url):
o = urllib.parse.urlsplit(url)
html = urlopen(url)
bs = BeautifulSoup(html, 'html.parser')
for link in bs.find_all('a', href = re.compile('^((https://)|(http://))')):
if 'href' in link.attrs:
if o.netloc in (link.attrs['href']):
continue
else:
ext.add(link.attrs['href'])
getExt('https://oreilly.com/')
for i in ext:
print(i)