制作我的网络抓取工具

时间:2015-02-15 23:59:59

标签: python web-crawler pycharm

我试图使用排序算法制作一个网络抓取工具,该算法显示网页排名的基本概念,但不幸的是它不起作用,并且给了我一些对我来说没有意义的错误。这是错误:

Traceback (most recent call last):



File "C:/Users/Janis/Desktop/WebCrawler/Web_crawler.py", line 88, in <module>
    webpages()
  File "C:/Users/Janis/Desktop/WebCrawler/Web_crawler.py", line 17, in webpages
    get_single_item_data(href)
  File "C:/Users/Janis/Desktop/WebCrawler/Web_crawler.py", line 21, in get_single_item_data
    source_code = requests.get(item_url)
  File "C:\Python34\lib\site-packages\requests\api.py", line 65, in get
    return request('get', url, **kwargs)
  File "C:\Python34\lib\site-packages\requests\api.py", line 49, in request
    response = session.request(method=method, url=url, **kwargs)
  File "C:\Python34\lib\site-packages\requests\sessions.py", line 447, in request
    prep = self.prepare_request(req)
  File "C:\Python34\lib\site-packages\requests\sessions.py", line 378, in prepare_request
    hooks=merge_hooks(request.hooks, self.hooks),
  File "C:\Python34\lib\site-packages\requests\models.py", line 303, in prepare
    self.prepare_url(url, params)
  File "C:\Python34\lib\site-packages\requests\models.py", line 360, in prepare_url
    "Perhaps you meant http://{0}?".format(url))
requests.exceptions.MissingSchema: Invalid URL '//www.hm.com/gb/logout': No schema supplied. Perhaps you meant http:////www.hm.com/gb/logout?

如果我换线:

for link in soup.findAll ('a'):

为:

, {'class':' '}

它可以工作,但我的任务是抓取其他网页,这种情况不起作用。

这是我的代码:

import requests
from bs4 import BeautifulSoup
from collections import defaultdict
from operator import itemgetter

all_links = defaultdict(int)

def webpages():

        url = 'http://www.hm.com/gb/department/HOME'
        source_code = requests.get(url)
        text = source_code.text
        soup = BeautifulSoup(text)
        for link in soup.findAll ('a'):
            href = link.get('href')
            print(href)
            get_single_item_data(href)
        return all_links

def get_single_item_data(item_url):
    source_code = requests.get(item_url)
    text = source_code.text
    soup = BeautifulSoup(text)
    for link in soup.findAll('a'):
        href = link.get('href')
        if href and href.startswith('http://www.'):
            if href:
                all_links[href] += 1
            print(href)


def sort_algorithm(list):
    for index in range(1,len(list)):
        value= list[index]
        i = index - 1
        while i>=0:
            if value < list[i]:
                list[i+1] = list[i]
                list[i] = value
                i=i -1
            else:
                break

vieni = ["", "viens", "divi", "tris", "cetri", "pieci",
         "sesi", "septini", "astoni", "devini"]
padsmiti = ["", "vienpadsmit", "divpadsmit", "trispadsmit", "cetrpadsmit",
         "piecpadsmit", 'sespadsmit', "septinpadsmit", "astonpadsmit", "devinpadsmit"]
desmiti = ["", "desmit", "divdesmit", "trisdesmit", "cetrdesmit",
        "piecdesmit", "sesdesmit", "septindesmit", "astondesmit", "devindesmit"]



def num_to_words(n):
    words = []
    if n == 0:
        words.append("zero")
    else:
        num_str = "{}".format(n)
        groups = (len(num_str) + 2) // 3
        num_str = num_str.zfill(groups * 3)
        for i in range(0, groups * 3, 3):
            h = int(num_str[i])
            t = int(num_str[i + 1])
            u = int(num_str[i + 2])
            print()
            print(vieni[i])
            g = groups - (i // 3 + 1)
            if h >= 1:
                words.append(vieni[h])
                words.append("hundred")
                if int(num_str) % 100:
                    words.append("and")
            if t > 1:
                words.append(desmiti[t])
                if u >= 1:
                    words.append(vieni[u])
            elif t == 1:
                if u >= 1:
                    words.append(padsmiti[u])
                else:
                    words.append(desmiti[t])
            else:
                if u >= 1:
                    words.append(vieni[u])

    return " ".join(words)

webpages()

for k, v in sorted(webpages().items(),key=itemgetter(1),reverse=True):
    print(k, num_to_words(v))

2 个答案:

答案 0 :(得分:0)

您只需要确保链接中存在协议。例如,您在锚标记中抓取的链接很可能是架构较少(前两个http中没有https//。这通常用于告诉浏览器使用用于检索页面的原始协议(如果您转到http://a.foo.com,指定为//a.foo.com/bar的所有链接都将使用http,反之亦然,如果您最初到达页面通过https)。要解决这个问题,你应该测试你从http或https页面检索的hrefs,如果它不存在则添加适当的href(你可以传入用于检索原始页面的协议)。

例如,您可以将get_single_item_data()更改为:

def get_single_item_data(item_url):
    if not item_url.startswith('http:'):
        item_url = 'http:' + item_url
    source_code = requests.get(item_url)
    text = source_code.text
    soup = BeautifulSoup(text)
    for link in soup.findAll('a'):
        href = link.get('href')
        if href and href.startswith('http://www.'):
            if href:
                all_links[href] += 1
            print(href)

当然这不会解决所有问题(如果最初没有提供任何//,如果链接实际上是相对于页面上的ID(以#开头等)但它可能是一个好的开始 - 网页的清理不是很有趣,也不是微不足道的:(。(另请注意,硬编码的值很糟糕,就像我对http:所做的那样,但我认为这是公平的传入的URL以http开头。)

答案 1 :(得分:0)

我的建议是不要同时进行网页排名和网页抓取,你只是在寻找麻烦,当你必须考虑页面排名时你必须考虑你已经抓取过的每一个页面,如果你每次抓取页面都这样做,那么金额有些时候你必须考虑每个页面是被爬网页面的因子(!NoPagesCrawled)。你最好爬几千或多少,然后把它们整理好,只是一个想法:)