Web爬网程序HTTP错误403:禁止

时间:2012-12-21 09:39:26

标签: python web-crawler http-error

我是一个新手试图写一个网络蜘蛛的脚本。 我想转到页面,在文本框中输入数据,通过单击提交按钮转到下一页并检索新页面上的所有数据,迭代

以下是我正在尝试的代码:

import urllib
import urllib2
import string
import sys
from BeautifulSoup import BeautifulSoup

hdr = {'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.11 (KHTML, like Gecko) Chrome/23.0.1271.64 Safari/537.11','Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8','Accept-Charset': 'ISO-8859-1,utf-8;q=0.7,*;q=0.3','Accept-Encoding': 'none','Accept-Language': 'en-US,en;q=0.8','Connection': 'keep-alive'}
values = {'query' : '5ed10c844ed4266a18d34e2ba06b381a' }
data = urllib.urlencode(values)
request = urllib2.Request("https://www.virustotal.com/#search", data, headers=hdr)
response = urllib2.urlopen(request)
the_page = response.read()
pool = BeautifulSoup(the_page)

print pool

以下是错误:

Traceback (most recent call last):
File "C:\Users\Dipanshu\Desktop\webscraping_demo.py", line 19, in <module>
response = urllib2.urlopen(request)
File "C:\Python27\lib\urllib2.py", line 126, in urlopen
return _opener.open(url, data, timeout)
File "C:\Python27\lib\urllib2.py", line 406, in open
response = meth(req, response)
File "C:\Python27\lib\urllib2.py", line 519, in http_response
'http', request, response, code, msg, hdrs)
File "C:\Python27\lib\urllib2.py", line 444, in error
return self._call_chain(*args)
File "C:\Python27\lib\urllib2.py", line 378, in _call_chain
result = func(*args)
File "C:\Python27\lib\urllib2.py", line 527, in http_error_default
raise HTTPError(req.get_full_url(), code, msg, hdrs, fp)
HTTPError: HTTP Error 403: Forbidden

我该如何解决这个问题?

2 个答案:

答案 0 :(得分:1)

print(list_NA)

['DEN', 'BOS', 'DAB', 'MIB', 'SAA', 'LAB', 'NYB', 'AGA', 'QRO', 'DCC', 'PBC', 'MIC', 'MDW', 'SAB', 'LAA', 'NYA', 'PHL', 'DCB', 'CHA', 'CHB', 'SEB', 'AGB', 'SEC', 'DAA', 'MEX']

target_url:Google搜索网页上的“猫”

“标题”将帮助您度过“禁止的”错误。 这段代码

答案 1 :(得分:0)

根据我的理解,您的request参数设置不正确,并且(可能)将您的蜘蛛驱动到您不应查看的页面。

This user had a similar problem, but fixed it by modifying the headers