Question

我正在尝试下载页面的HTML（在这种情况下为http://www.guangxindai.com），但我收到错误403.这是我的代码：

import urllib.request
opener = urllib.request.build_opener()
opener.addheaders = [('User-agent', 'Mozilla/5.0')]
f = opener.open("http://www.guangxindai.com")
f.read()

但我收到了错误回复。

Traceback (most recent call last):
  File "<pyshell#7>", line 1, in <module>
    f = opener.open("http://www.guangxindai.com")
  File "C:\Python33\lib\urllib\request.py", line 475, in open
    response = meth(req, response)
  File "C:\Python33\lib\urllib\request.py", line 587, in http_response
    'http', request, response, code, msg, hdrs)
  File "C:\Python33\lib\urllib\request.py", line 513, in error
    return self._call_chain(*args)
  File "C:\Python33\lib\urllib\request.py", line 447, in _call_chain
    result = func(*args)
  File "C:\Python33\lib\urllib\request.py", line 595, in http_error_default
    raise HTTPError(req.full_url, code, msg, hdrs, fp)
urllib.error.HTTPError: HTTP Error 403: Forbidden

我尝试了不同的请求标头，但仍然无法获得正确的响应。我可以通过浏览器查看网页。这对我来说似乎很奇怪。我猜网络使用一些方法来阻止网络蜘蛛。有谁知道发生了什么？如何正确获取页面的HTML？

Answer 1

我遇到了同样的问题，你和我在link找到了答案。

Stefano Sanfilippo提供的答案非常简单，对我有用：

import urllib.request
from urllib.request import Request, urlopen

url_request = Request("http://www.guangxindai.com", 
                      headers = {"User-Agent", "Mozilla/5.0"})
webpage = urlopen(url_request).read()

Answer 2

如果您的目的是阅读页面的html，您可以使用以下代码。它在Python 2.7上适用于我

import urllib
f = urllib.urlopen("http://www.guangxindai.com")
f.read()

Python urllib.request.urlopen（）返回错误403

2 个答案: