Question

我自学Python，并想出了一个简单的网络爬虫引擎。代码如下，

def find_next_url(page):
    start_of_url_line = page.find('<a href')
    if start_of_url_line == -1:
        return None, 0
else:
    start_of_url = page.find('"http', start_of_url_line)
    if start_of_url == -1:
        return None, 0
    else:
        end_of_url = page.find('"', start_of_url + 1)
        one_url = page[start_of_url + 1 : end_of_url]
        return one_url, end_of_url 

def get_all_url(page):
p = []
while True:
    url, end_pos = find_next_url(page)
    if url:
        p.append(url)
        page = page[end_pos + 1 : ]
    else:
        break
return p

def union(a, b):
    for e in b:
    if e not in a:
        a.append(e)
    return a

def webcrawl(seed):
    tocrawl = [seed]
    crawled = []
    while True:
        page = tocrawl.pop()
        if page not in crawled:
            import urllib.request
            intpage = urllib.request.urlopen(page).read()
            openpage = str(intpage)
            union(tocrawl, get_all_url(openpage))
            crawled.append(page)
    return crawled

但是我总是收到HTTP 403错误。

Answer 1

HTTP 403错误与您的代码无关。这意味着禁止访问的URL被访问。大多数情况下，这意味着该页面仅供登录用户或特定用户使用。

我实际上运行了你的代码，并通过creativecommons链接获得了403。原因是默认情况下urllib不会发送Host标头，您应手动添加它以避免错误（大多数服务器将检查Host标头并确定它们的内容应该服务）。您也可以使用Requests python package代替默认发送Host标头的内置urllib，它是更加pythonic的IMO。

我添加了一个try-exept子句来捕获并记录错误，然后继续抓取其他链接。网上有很多破损的链接。

from urllib.request import urlopen
from urllib.error import HTTPError
...
def webcrawl(seed):
    tocrawl = [seed]
    crawled = []
    while True:
        page = tocrawl.pop()
        if page not in crawled:
            try:
                intpage = urlopen(page).read()
                openpage = str(intpage)
                union(tocrawl, get_all_url(openpage))
                crawled.append(page)
            except HTTPError as ex:
                print('got http error while crawling', page)
    return crawled

Answer 2

您可能需要添加请求标头或其他身份验证。尝试添加用户代理以避免在某些情况下reCaptcha。

示例：

    User-Agent:Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/62.0.3202.94 Safari/537.36

Answer 3

正如其他人所说，错误不是由代码本身引起的，但你可能想尝试做几件事

尝试添加异常处理程序，甚至可以暂时忽略有问题的页面，以确保抓取工具按预期工作：

def webcrawl(seed):
    tocrawl = [seed]
    crawled = []
    while tocrawl: # replace `while True` with an actual condition,
                   # otherwise you'll be stuck in an infinite loop
                   # until you hit an exception
        page = tocrawl.pop()
        if page not in crawled:
            import urllib.request
            try:
                intpage = urllib.request.urlopen(page).read()
                openpage = str(intpage)
                union(tocrawl, get_all_url(openpage))
                crawled.append(page)
            except urllib.error.HTTPError as e:  # catch an exception
                if e.code == 401:  # check the status code and take action
                    pass  # or anything else you want to do in case of an `Unauthorized` error
                elif e.code == 403:
                    pass  # or anything else you want to do in case of a `Forbidden` error
                elif e.cide == 404:
                    pass   # or anything else you want to do in case of a `Not Found` error
                # etc
                else:
                    print('Exception:\n{}'.format(e))  # print an unexpected exception
                    sys.exit(1)  # finish the process with exit code 1 (indicates there was a problem)
    return crawled

尝试在您的请求中添加User-Agent标头。来自urllib.request docs：

这常用于 “欺骗”User-Agent标头，浏览器使用该标头进行标识本身 - 一些HTTP服务器只允许来自公共的请求浏览器而不是脚本。例如，Mozilla Firefox可能会认同自己 "Mozilla/5.0 (X11; U; Linux i686) Gecko/20071127 Firefox/2.0.0.11"，而urllib的默认用户代理字符串是 "Python-urllib/2.6"（在Python 2.6上）。

所以这样的事情可能有助于解决403错误中的一些错误：

    headers = {'User-Agent':'Mozilla/5.0 (X11; U; Linux i686) Gecko/20071127 Firefox/2.0.0.11'}
    req = urllib.request.Request(page, headers=headers)
    intpage = urllib.request.urlopen(req).read()
    openpage = str(intpage)

Python中的简单Web爬虫

3 个答案: