Question

我正在尝试使用以下代码打开URL来解析内容。但是当我尝试通过python而不是通过Web浏览器使用相同的URL时，我收到403错误。有什么帮助来克服这个问题吗？

import urllib2
URL = 'http://www.google.com/search?q=something%20unusual'
response = urllib2.urlopen(URL)

来自Py Interpreter的响应：HTTPError：HTTP错误403：禁止

Answer 1

Google正在使用用户代理过滤来阻止僵尸程序与其搜索服务进行交互。您可以通过将这些结果与curl(1)进行比较并选择使用-A标志来更改User-Agent字符串来观察此情况：

$ curl -I 'http://www.google.com/search?q=something%20unusual'
HTTP/1.1 403 Forbidden
...

$ curl -I 'http://www.google.com/search?q=something%20unusual' -A 'Mozilla/5.0 (Windows NT 6.1; WOW64; rv:21.0) Gecko/20100101 Firefox/21.0'
HTTP/1.1 200 OK

您应该使用Google Custom Search服务来自动执行Google搜索。或者，您可以使用urllib2库设置自己的User-Agent标头（而不是默认的"Python-urllib/2.6"），但这可能会违反Google的服务条款。

Answer 2

User-Agent标题是给您带来问题的标题。在我看来，网页禁止通过检查User-Agent标题从非浏览器发出的任何请求。关键是设置User-Agent模拟python中的浏览器。

这对我有用：

In [1]: import urllib2

In [2]: URL = 'http://www.google.com/search?q=something%20unusual'

In [4]: opener = urllib2.build_opener()

In [5]: opener.addheaders = [('User-agent', 'Mozilla/5.0')]

In [6]: response = opener.open(URL)

In [7]: response
Out[7]: <addinfourl at 47799472 whose fp = <socket._fileobject object at 0x02D7F5B0>>

In [8]: response.read()

希望这有帮助！

HTTP403错误urllib2.urlopen（URL）

2 个答案: