我正在尝试使用iPython上的BeautifulSoup4解析此页面:http://www.chronicle.com/article/Major-Private-Gifts-to-Higher/128264。我写了这些代码行:
import urllib.request as ur
import re
page = ur.urlopen('http://www.chronicle.com/article/Major-Private-Gifts-to-Higher/128264').read()
然后我收到了这个错误:
HTTPError Traceback (most recent call
last)
<ipython-input-27-8d5066f9c76f> in <module>()
----> 1 s = ur.urlopen("http://www.chronicle.com/article/Major-Private-
Gifts-to-Higher/128264")
/Users/name/anaconda/lib/python3.6/urllib/request.py in urlopen(url,
data, timeout, cafile, capath, cadefault, context)
221 else:
222 opener = _opener
--> 223 return opener.open(url, data, timeout)
224
225 def install_opener(opener):
/Users/name/anaconda/lib/python3.6/urllib/request.py in open(self,
fullurl, data, timeout)
530 for processor in self.process_response.get(protocol, []):
531 meth = getattr(processor, meth_name)
--> 532 response = meth(req, response)
533
534 return response
/Users/name/anaconda/lib/python3.6/urllib/request.py in
http_response(self, request, response)
640 if not (200 <= code < 300):
641 response = self.parent.error(
--> 642 'http', request, response, code, msg, hdrs)
643
644 return response
/Users/name/anaconda/lib/python3.6/urllib/request.py in error(self,
proto, *args)
568 if http_err:
569 args = (dict, 'default', 'http_error_default') +
orig_args
--> 570 return self._call_chain(*args)
571
572 # XXX probably also want an abstract factory that knows when it
makes
/Users/name/anaconda/lib/python3.6/urllib/request.py in
_call_chain(self, chain, kind, meth_name, *args)
502 for handler in handlers:
503 func = getattr(handler, meth_name)
--> 504 result = func(*args)
505 if result is not None:
506 return result
/Users/name/anaconda/lib/python3.6/urllib/request.py in
http_error_default(self, req, fp, code, msg, hdrs)
648 class HTTPDefaultErrorHandler(BaseHandler):
649 def http_error_default(self, req, fp, code, msg, hdrs):
--> 650 raise HTTPError(req.full_url, code, msg, hdrs, fp)
651
652 class HTTPRedirectHandler(BaseHandler):
HTTPError: HTTP Error 403: Forbidden
我该如何解决这个问题?提前谢谢!
答案 0 :(得分:1)
您可能还必须发送所需的HTTP标头。通过使用浏览器的开发工具查看Firefox发送到页面的标题。将这些添加到请求中。我想至少User-Agent是必须设置的标头之一。
答案 1 :(得分:1)
使用请求模块更容易,并且已被证明更易于使用。
然而,问题是之前的Stackoverflow用户所说的,它确实需要一些标题等。就我所知,模块requests
具有内置支持。请注意,我们使用的方法不是.read()
,而是.text
import requests
from bs4 import BeautifulSoup as bs
urlopen = requests.get('http://www.chronicle.com/article/Major-Private-Gifts-to-Higher/128264').text
soup = bs(urlopen,'lxml')
print(soup)
你不需要用beautfiulSoup解析它你可以......
import requests
urlopen = requests.get('http://www.chronicle.com/article/Major-Private-Gifts-to-Higher/128264').text
print(urlopen)