HTTP错误403:禁止urlib2 Python 2.7

时间:2016-03-07 00:28:23

标签: python html http urllib2

我已经成功地使用了urllib2但是对于这个网站,我突然测试它不起作用。我查看了论坛并尝试了一些修复,但似乎没有用。下面是一个解决方法的例子,但不适合我。有人可以帮助我连接到它。

提供错误的代码:

from bs4 import BeautifulSoup
import urllib2

proxy_support = urllib2.ProxyHandler({"http":"http://username:password@ip:port"})
hdr = {'Accept': 'text/html,application/xhtml+xml,*/*'}
url = 'http://www.carnextdoor.com.au/'
opener = urllib2.build_opener(proxy_support)
urllib2.install_opener(opener)
req=urllib2.Request(url,headers=hdr)
#Here I get the error with and without using the header or going html = urllib2.urlopen(url).read()
html = urllib2.urlopen(req).read()
soup=BeautifulSoup(html,"html5lib")
print soup

1 个答案:

答案 0 :(得分:0)

我得到了403,直到我添加了一个用户代理,以下内容对我来说足够了:

hdr = {'Accept': 'text/html,application/xhtml+xml,*/*',"user-agent":"Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/48.0.2564.116 Safari/537.36"}
url = 'http://www.carnextdoor.com.au/'


req=urllib2.Request(url,headers=hdr)
#Here I get the error with and without using the header or going html = urllib2.urlopen(url).read()
html = urllib2.urlopen(req).read()
soup=BeautifulSoup(html,"html5lib")
print soup

没有用户代理:

In [10]: hdr = {'Accept': 'text/html,application/xhtml+xml,*/*'}

In [11]: url = 'http://www.carnextdoor.com.au/'

In [12]: req=urllib2.Request(url,headers=hdr)

In [13]: html = urllib2.urlopen(req).read()
---------------------------------------------------------------------------
HTTPError                                 Traceback (most recent call last)
<ipython-input-13-dbeb64d95cd3> in <module>()
----> 1 html = urllib2.urlopen(req).read()

使用用户代理:

In [20]: hdr = {'Accept': 'text/html,application/xhtml+xml,*/*',"user-agent":"Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/48.0.2564.116 Safari/537.36"}

In [21]: req=urllib2.Request(url,headers=hdr)
In [22]: html = urllib2.urlopen(req).read()
In [23]: 

在没有任何用户代理的情况下使用requests也可以正常工作。

In [28]: import requests

In [29]: r = requests.get(url)

In [30]: r.status_code
Out[30]: 200