BeautifulSoup在解析HTML时无法返回什么值?

时间:2017-09-03 10:01:00

标签: python-3.x beautifulsoup web-crawler

我正在尝试抓取网页并尝试提取最高级别​​为3的网址。我的代码如下:

import lxml.html
import urllib.request
from bs4 import BeautifulSoup

stopLevel = 3
rootUrls = ['http://ps.ucdavis.edu/']

foundUrls = {}
for rootUrl in rootUrls:
    foundUrls.update({rootUrl : {'Level':0, 'Parent':'N/A'}})

def getProtocolAndDomainName(url):
    protocolAndOther = url.split('://')
    # splitting url by '://' and retrun a list
    ptorocol = protocolAndOther[0]
    domainName = protocolAndOther[1].split('/')[0]
    # this will only return 'https://xxxxx.com'
    return ptorocol + '://' + domainName

foundUrls = {}
for rootUrl in rootUrls:
    foundUrls.update({rootUrl : 0})

def getProtocolAndDomainName(url):
    protocolAndOther = url.split('://')
    ptorocol = protocolAndOther[0]
    domainName = protocolAndOther[1].split('/')[0]
    return ptorocol + '://' + domainName


def crawl(urls, stopLevel = 5, level=1):
    nextUrls = []
    if (level <= stopLevel):
        for url in urls:
            # need to handle urls (e.g., https) that cannot be read
            try:
                openedUrl = urllib.request.urlopen(url).read()
                soup = BeautifulSoup(openedUrl, 'html.parser')
            except:
                print('cannot read for :' + url)

            for a in soup.find_all('a', href=True):
                href = a['href']
                if href is not None:
                    # for the case of a link is relative path
                    if '://' not in href:
                        href = getProtocolAndDomainName(url) + href
                    # check url has been already visited or not
                    if href not in foundUrls:
                        foundUrls.update({href: {'Level':level, 
                        'Parent':url}})
                        nextUrls.append(href)
        # recursive call
        crawl(nextUrls, stopLevel, level + 1)

crawl(rootUrls, stopLevel)
print(foundUrls)

运行代码后,它会显示错误消息UnboundLocalError: local variable 'soup' referenced before assignment。我知道发生此问题是因为BeautifulSoup无法解析openedUrl,因此未定义此局部变量soup,这进一步导致此循环失败。为了解决这个问题,我的第一个解决方案是将soup设置为globaldef crawl(urls, stopLevel = 5, level=1):global soup。但是,有人告诉我,这根本不能解决问题。我的第二个解决方案是使用if...continueBeautifulSoup无法解析时保持循环运行,但我现在面临的问题是无论我设置if soup == ' '还是if soup == None它仍然无法正常工作。我想知道BeautifulSoup失败时返回的值。有人可以帮忙吗?或者有没有其他解决方案?非常感谢。

2 个答案:

答案 0 :(得分:1)

通常当BeautifulSoup无法解析文档时,仍会返回bs4对象,但会输出警告。如果你提供的东西不是字符串或缓冲区,它会引发TypeError

在这种情况下,异常很可能由urllib而不是BeautifulSoup引发,但是你会抓住它并继续执行你的脚本而不会真正处理异常。
这会导致下一行出现NameError异常,因为soup块中的try定义失败,因此soup未定义。

作为快速修复,您可以使用continue,以便您的循环移动到下一个项目。

try:
    openedUrl = urllib2.urlopen(url).read()
    soup = BeautifulSoup(openedUrl, 'html.parser')
except urllib.error.HTTPError as e:
    print('HTTP Error ' + str(e.code) + ' for: ' + url)
    continue
except KeyboardInterrupt: 
    print('Script terminated by user.')
    return
except Exception as e:
    print(e) 
    continue

答案 1 :(得分:0)

在打开从URL获取HTML结果的请求后,您应该致电read()

openedUrl = urllib.request.urlopen(url).read()

更新:该网站正在阻止urllib的内置用户代理,以解决您应该使用Firefox的用户代理屏蔽它。

        try:
            user_agent = {'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; WOW64; rv:12.0) Gecko/20100101 Firefox/12.0'}
            req = urllib.request.Request(url=url, headers=user_agent)
            openedUrl = urllib.request.urlopen(req)
            soup = BeautifulSoup(openedUrl, 'html.parser')
        except:
            print('cannot read for :' + url)