Python中的简单Web爬虫

时间:2017-11-28 13:25:54

标签: python

我自学Python,并想出了一个简单的网络爬虫引擎。代码如下,

def find_next_url(page):
    start_of_url_line = page.find('<a href')
    if start_of_url_line == -1:
        return None, 0
else:
    start_of_url = page.find('"http', start_of_url_line)
    if start_of_url == -1:
        return None, 0
    else:
        end_of_url = page.find('"', start_of_url + 1)
        one_url = page[start_of_url + 1 : end_of_url]
        return one_url, end_of_url 

def get_all_url(page):
p = []
while True:
    url, end_pos = find_next_url(page)
    if url:
        p.append(url)
        page = page[end_pos + 1 : ]
    else:
        break
return p

def union(a, b):
    for e in b:
    if e not in a:
        a.append(e)
    return a

def webcrawl(seed):
    tocrawl = [seed]
    crawled = []
    while True:
        page = tocrawl.pop()
        if page not in crawled:
            import urllib.request
            intpage = urllib.request.urlopen(page).read()
            openpage = str(intpage)
            union(tocrawl, get_all_url(openpage))
            crawled.append(page)
    return crawled

但是我总是收到HTTP 403错误。

3 个答案:

答案 0 :(得分:1)

HTTP 403错误与您的代码无关。这意味着禁止访问的URL被访问。大多数情况下,这意味着该页面仅供登录用户或特定用户使用。

我实际上运行了你的代码,并通过creativecommons链接获得了403。原因是默认情况下urllib不会发送Host标头,您应手动添加它以避免错误(大多数服务器检查Host标头并确定它们的内容应该服务)。您也可以使用Requests python package代替默认发送Host标头的内置urllib,它是更加pythonic的IMO。

我添加了一个try-exept子句来捕获并记录错误,然后继续抓取其他链接。网上有很多破损的链接。

from urllib.request import urlopen
from urllib.error import HTTPError
...
def webcrawl(seed):
    tocrawl = [seed]
    crawled = []
    while True:
        page = tocrawl.pop()
        if page not in crawled:
            try:
                intpage = urlopen(page).read()
                openpage = str(intpage)
                union(tocrawl, get_all_url(openpage))
                crawled.append(page)
            except HTTPError as ex:
                print('got http error while crawling', page)
    return crawled

答案 1 :(得分:1)

您可能需要添加请求标头或其他身份验证。 尝试添加用户代理以避免在某些情况下reCaptcha。

示例:

    User-Agent:Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/62.0.3202.94 Safari/537.36

答案 2 :(得分:0)

正如其他人所说,错误不是由代码本身引起的,但你可能想尝试做几件事

  • 尝试添加异常处理程序,甚至可以暂时忽略有问题的页面,以确保抓取工具按预期工作:

    def webcrawl(seed):
        tocrawl = [seed]
        crawled = []
        while tocrawl: # replace `while True` with an actual condition,
                       # otherwise you'll be stuck in an infinite loop
                       # until you hit an exception
            page = tocrawl.pop()
            if page not in crawled:
                import urllib.request
                try:
                    intpage = urllib.request.urlopen(page).read()
                    openpage = str(intpage)
                    union(tocrawl, get_all_url(openpage))
                    crawled.append(page)
                except urllib.error.HTTPError as e:  # catch an exception
                    if e.code == 401:  # check the status code and take action
                        pass  # or anything else you want to do in case of an `Unauthorized` error
                    elif e.code == 403:
                        pass  # or anything else you want to do in case of a `Forbidden` error
                    elif e.cide == 404:
                        pass   # or anything else you want to do in case of a `Not Found` error
                    # etc
                    else:
                        print('Exception:\n{}'.format(e))  # print an unexpected exception
                        sys.exit(1)  # finish the process with exit code 1 (indicates there was a problem)
        return crawled
    
  • 尝试在您的请求中添加User-Agent标头。来自urllib.request docs

  

这常用于   “欺骗”User-Agent标头,浏览器使用该标头进行标识   本身 - 一些HTTP服务器只允许来自公共的请求   浏览器而不是脚本。例如,Mozilla Firefox可能会   认同自己   "Mozilla/5.0 (X11; U; Linux i686) Gecko/20071127 Firefox/2.0.0.11",   而urllib的默认用户代理字符串是   "Python-urllib/2.6"(在Python 2.6上)。

所以这样的事情可能有助于解决403错误中的一些错误:

    headers = {'User-Agent':'Mozilla/5.0 (X11; U; Linux i686) Gecko/20071127 Firefox/2.0.0.11'}
    req = urllib.request.Request(page, headers=headers)
    intpage = urllib.request.urlopen(req).read()
    openpage = str(intpage)