我正在尝试抓取网页并尝试提取最高级别为3的网址。我的代码如下:
import lxml.html
import urllib.request
from bs4 import BeautifulSoup
stopLevel = 3
rootUrls = ['http://ps.ucdavis.edu/']
foundUrls = {}
for rootUrl in rootUrls:
foundUrls.update({rootUrl : {'Level':0, 'Parent':'N/A'}})
def getProtocolAndDomainName(url):
protocolAndOther = url.split('://')
# splitting url by '://' and retrun a list
ptorocol = protocolAndOther[0]
domainName = protocolAndOther[1].split('/')[0]
# this will only return 'https://xxxxx.com'
return ptorocol + '://' + domainName
foundUrls = {}
for rootUrl in rootUrls:
foundUrls.update({rootUrl : 0})
def getProtocolAndDomainName(url):
protocolAndOther = url.split('://')
ptorocol = protocolAndOther[0]
domainName = protocolAndOther[1].split('/')[0]
return ptorocol + '://' + domainName
def crawl(urls, stopLevel = 5, level=1):
nextUrls = []
if (level <= stopLevel):
for url in urls:
# need to handle urls (e.g., https) that cannot be read
try:
openedUrl = urllib.request.urlopen(url).read()
soup = BeautifulSoup(openedUrl, 'html.parser')
except:
print('cannot read for :' + url)
for a in soup.find_all('a', href=True):
href = a['href']
if href is not None:
# for the case of a link is relative path
if '://' not in href:
href = getProtocolAndDomainName(url) + href
# check url has been already visited or not
if href not in foundUrls:
foundUrls.update({href: {'Level':level,
'Parent':url}})
nextUrls.append(href)
# recursive call
crawl(nextUrls, stopLevel, level + 1)
crawl(rootUrls, stopLevel)
print(foundUrls)
运行代码后,它会显示错误消息UnboundLocalError: local variable 'soup' referenced before assignment
。我知道发生此问题是因为BeautifulSoup
无法解析openedUrl
,因此未定义此局部变量soup
,这进一步导致此循环失败。为了解决这个问题,我的第一个解决方案是将soup
设置为global
在def crawl(urls, stopLevel = 5, level=1):
下global soup
。但是,有人告诉我,这根本不能解决问题。我的第二个解决方案是使用if...continue
在BeautifulSoup
无法解析时保持循环运行,但我现在面临的问题是无论我设置if soup == ' '
还是if soup == None
它仍然无法正常工作。我想知道BeautifulSoup失败时返回的值。有人可以帮忙吗?或者有没有其他解决方案?非常感谢。
答案 0 :(得分:1)
通常当BeautifulSoup
无法解析文档时,仍会返回bs4
对象,但会输出警告。如果你提供的东西不是字符串或缓冲区,它会引发TypeError
。
在这种情况下,异常很可能由urllib
而不是BeautifulSoup
引发,但是你会抓住它并继续执行你的脚本而不会真正处理异常。
这会导致下一行出现NameError
异常,因为soup
块中的try
定义失败,因此soup
未定义。
作为快速修复,您可以使用continue
,以便您的循环移动到下一个项目。
try:
openedUrl = urllib2.urlopen(url).read()
soup = BeautifulSoup(openedUrl, 'html.parser')
except urllib.error.HTTPError as e:
print('HTTP Error ' + str(e.code) + ' for: ' + url)
continue
except KeyboardInterrupt:
print('Script terminated by user.')
return
except Exception as e:
print(e)
continue
答案 1 :(得分:0)
在打开从URL获取HTML结果的请求后,您应该致电read()
。
openedUrl = urllib.request.urlopen(url).read()
更新:该网站正在阻止urllib的内置用户代理,以解决您应该使用Firefox的用户代理屏蔽它。
try:
user_agent = {'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; WOW64; rv:12.0) Gecko/20100101 Firefox/12.0'}
req = urllib.request.Request(url=url, headers=user_agent)
openedUrl = urllib.request.urlopen(req)
soup = BeautifulSoup(openedUrl, 'html.parser')
except:
print('cannot read for :' + url)