我自学Python,并想出了一个简单的网络爬虫引擎。代码如下,
def find_next_url(page):
start_of_url_line = page.find('<a href')
if start_of_url_line == -1:
return None, 0
else:
start_of_url = page.find('"http', start_of_url_line)
if start_of_url == -1:
return None, 0
else:
end_of_url = page.find('"', start_of_url + 1)
one_url = page[start_of_url + 1 : end_of_url]
return one_url, end_of_url
def get_all_url(page):
p = []
while True:
url, end_pos = find_next_url(page)
if url:
p.append(url)
page = page[end_pos + 1 : ]
else:
break
return p
def union(a, b):
for e in b:
if e not in a:
a.append(e)
return a
def webcrawl(seed):
tocrawl = [seed]
crawled = []
while True:
page = tocrawl.pop()
if page not in crawled:
import urllib.request
intpage = urllib.request.urlopen(page).read()
openpage = str(intpage)
union(tocrawl, get_all_url(openpage))
crawled.append(page)
return crawled
但是我总是收到HTTP 403错误。
答案 0 :(得分:1)
HTTP 403错误与您的代码无关。这意味着禁止访问的URL被访问。大多数情况下,这意味着该页面仅供登录用户或特定用户使用。
我实际上运行了你的代码,并通过creativecommons链接获得了403。原因是默认情况下urllib不会发送Host
标头,您应手动添加它以避免错误(大多数服务器将检查Host
标头并确定它们的内容应该服务)。您也可以使用Requests python package代替默认发送Host
标头的内置urllib,它是更加pythonic的IMO。
我添加了一个try-exept子句来捕获并记录错误,然后继续抓取其他链接。网上有很多破损的链接。
from urllib.request import urlopen
from urllib.error import HTTPError
...
def webcrawl(seed):
tocrawl = [seed]
crawled = []
while True:
page = tocrawl.pop()
if page not in crawled:
try:
intpage = urlopen(page).read()
openpage = str(intpage)
union(tocrawl, get_all_url(openpage))
crawled.append(page)
except HTTPError as ex:
print('got http error while crawling', page)
return crawled
答案 1 :(得分:1)
您可能需要添加请求标头或其他身份验证。 尝试添加用户代理以避免在某些情况下reCaptcha。
示例:
User-Agent:Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/62.0.3202.94 Safari/537.36
答案 2 :(得分:0)
正如其他人所说,错误不是由代码本身引起的,但你可能想尝试做几件事
尝试添加异常处理程序,甚至可以暂时忽略有问题的页面,以确保抓取工具按预期工作:
def webcrawl(seed):
tocrawl = [seed]
crawled = []
while tocrawl: # replace `while True` with an actual condition,
# otherwise you'll be stuck in an infinite loop
# until you hit an exception
page = tocrawl.pop()
if page not in crawled:
import urllib.request
try:
intpage = urllib.request.urlopen(page).read()
openpage = str(intpage)
union(tocrawl, get_all_url(openpage))
crawled.append(page)
except urllib.error.HTTPError as e: # catch an exception
if e.code == 401: # check the status code and take action
pass # or anything else you want to do in case of an `Unauthorized` error
elif e.code == 403:
pass # or anything else you want to do in case of a `Forbidden` error
elif e.cide == 404:
pass # or anything else you want to do in case of a `Not Found` error
# etc
else:
print('Exception:\n{}'.format(e)) # print an unexpected exception
sys.exit(1) # finish the process with exit code 1 (indicates there was a problem)
return crawled
尝试在您的请求中添加User-Agent
标头。来自urllib.request docs:
这常用于 “欺骗”
User-Agent
标头,浏览器使用该标头进行标识 本身 - 一些HTTP服务器只允许来自公共的请求 浏览器而不是脚本。例如,Mozilla Firefox可能会 认同自己"Mozilla/5.0 (X11; U; Linux i686) Gecko/20071127 Firefox/2.0.0.11"
, 而urllib的默认用户代理字符串是"Python-urllib/2.6"
(在Python 2.6上)。
所以这样的事情可能有助于解决403错误中的一些错误:
headers = {'User-Agent':'Mozilla/5.0 (X11; U; Linux i686) Gecko/20071127 Firefox/2.0.0.11'}
req = urllib.request.Request(page, headers=headers)
intpage = urllib.request.urlopen(req).read()
openpage = str(intpage)