Question

我正在尝试使用iPython上的BeautifulSoup4解析此页面：http://www.chronicle.com/article/Major-Private-Gifts-to-Higher/128264。我写了这些代码行：

import urllib.request as ur
import re
page = ur.urlopen('http://www.chronicle.com/article/Major-Private-Gifts-to-Higher/128264').read()

然后我收到了这个错误：

HTTPError                                 Traceback (most recent call 
last)
<ipython-input-27-8d5066f9c76f> in <module>()
----> 1 s = ur.urlopen("http://www.chronicle.com/article/Major-Private-
Gifts-to-Higher/128264")

/Users/name/anaconda/lib/python3.6/urllib/request.py in urlopen(url, 
data, timeout, cafile, capath, cadefault, context)
    221     else:
    222         opener = _opener
--> 223     return opener.open(url, data, timeout)
    224 
    225 def install_opener(opener):

/Users/name/anaconda/lib/python3.6/urllib/request.py in open(self, 
fullurl, data, timeout)
    530         for processor in self.process_response.get(protocol, []):
    531             meth = getattr(processor, meth_name)
--> 532             response = meth(req, response)
    533 
    534         return response

/Users/name/anaconda/lib/python3.6/urllib/request.py in 
http_response(self, request, response)
    640         if not (200 <= code < 300):
    641             response = self.parent.error(
--> 642                 'http', request, response, code, msg, hdrs)
    643 
    644         return response

/Users/name/anaconda/lib/python3.6/urllib/request.py in error(self, 
proto, *args)
568         if http_err:
569             args = (dict, 'default', 'http_error_default') + 
orig_args
--> 570             return self._call_chain(*args)
571 
572 # XXX probably also want an abstract factory that knows when it 
makes

/Users/name/anaconda/lib/python3.6/urllib/request.py in 
_call_chain(self, chain, kind, meth_name, *args)
502         for handler in handlers:
503             func = getattr(handler, meth_name)
--> 504             result = func(*args)
505             if result is not None:
506                 return result

/Users/name/anaconda/lib/python3.6/urllib/request.py in 
http_error_default(self, req, fp, code, msg, hdrs)
    648 class HTTPDefaultErrorHandler(BaseHandler):
    649     def http_error_default(self, req, fp, code, msg, hdrs):
 --> 650         raise HTTPError(req.full_url, code, msg, hdrs, fp)
    651 
    652 class HTTPRedirectHandler(BaseHandler):

HTTPError: HTTP Error 403: Forbidden

我该如何解决这个问题？提前谢谢！

Answer 1

您可能还必须发送所需的HTTP标头。通过使用浏览器的开发工具查看Firefox发送到页面的标题。将这些添加到请求中。我想至少User-Agent是必须设置的标头之一。

Answer 2

使用请求模块更容易，并且已被证明更易于使用。

然而，问题是之前的Stackoverflow用户所说的，它确实需要一些标题等。就我所知，模块requests具有内置支持。请注意，我们使用的方法不是.read()，而是.text

import requests
from bs4 import BeautifulSoup as bs

urlopen = requests.get('http://www.chronicle.com/article/Major-Private-Gifts-to-Higher/128264').text
soup = bs(urlopen,'lxml')

print(soup)

你不需要用beautfiulSoup解析它你可以......

import requests

urlopen = requests.get('http://www.chronicle.com/article/Major-Private-Gifts-to-Higher/128264').text
print(urlopen)

使用Python 3.6.1'HTTPError：HTTP Error 403：Forbidden'

2 个答案: