我试图使用排序算法制作一个网络抓取工具,该算法显示网页排名的基本概念,但不幸的是它不起作用,并且给了我一些对我来说没有意义的错误。这是错误:
Traceback (most recent call last):
File "C:/Users/Janis/Desktop/WebCrawler/Web_crawler.py", line 88, in <module>
webpages()
File "C:/Users/Janis/Desktop/WebCrawler/Web_crawler.py", line 17, in webpages
get_single_item_data(href)
File "C:/Users/Janis/Desktop/WebCrawler/Web_crawler.py", line 21, in get_single_item_data
source_code = requests.get(item_url)
File "C:\Python34\lib\site-packages\requests\api.py", line 65, in get
return request('get', url, **kwargs)
File "C:\Python34\lib\site-packages\requests\api.py", line 49, in request
response = session.request(method=method, url=url, **kwargs)
File "C:\Python34\lib\site-packages\requests\sessions.py", line 447, in request
prep = self.prepare_request(req)
File "C:\Python34\lib\site-packages\requests\sessions.py", line 378, in prepare_request
hooks=merge_hooks(request.hooks, self.hooks),
File "C:\Python34\lib\site-packages\requests\models.py", line 303, in prepare
self.prepare_url(url, params)
File "C:\Python34\lib\site-packages\requests\models.py", line 360, in prepare_url
"Perhaps you meant http://{0}?".format(url))
requests.exceptions.MissingSchema: Invalid URL '//www.hm.com/gb/logout': No schema supplied. Perhaps you meant http:////www.hm.com/gb/logout?
如果我换线:
for link in soup.findAll ('a'):
为:
, {'class':' '}
它可以工作,但我的任务是抓取其他网页,这种情况不起作用。
这是我的代码:
import requests
from bs4 import BeautifulSoup
from collections import defaultdict
from operator import itemgetter
all_links = defaultdict(int)
def webpages():
url = 'http://www.hm.com/gb/department/HOME'
source_code = requests.get(url)
text = source_code.text
soup = BeautifulSoup(text)
for link in soup.findAll ('a'):
href = link.get('href')
print(href)
get_single_item_data(href)
return all_links
def get_single_item_data(item_url):
source_code = requests.get(item_url)
text = source_code.text
soup = BeautifulSoup(text)
for link in soup.findAll('a'):
href = link.get('href')
if href and href.startswith('http://www.'):
if href:
all_links[href] += 1
print(href)
def sort_algorithm(list):
for index in range(1,len(list)):
value= list[index]
i = index - 1
while i>=0:
if value < list[i]:
list[i+1] = list[i]
list[i] = value
i=i -1
else:
break
vieni = ["", "viens", "divi", "tris", "cetri", "pieci",
"sesi", "septini", "astoni", "devini"]
padsmiti = ["", "vienpadsmit", "divpadsmit", "trispadsmit", "cetrpadsmit",
"piecpadsmit", 'sespadsmit', "septinpadsmit", "astonpadsmit", "devinpadsmit"]
desmiti = ["", "desmit", "divdesmit", "trisdesmit", "cetrdesmit",
"piecdesmit", "sesdesmit", "septindesmit", "astondesmit", "devindesmit"]
def num_to_words(n):
words = []
if n == 0:
words.append("zero")
else:
num_str = "{}".format(n)
groups = (len(num_str) + 2) // 3
num_str = num_str.zfill(groups * 3)
for i in range(0, groups * 3, 3):
h = int(num_str[i])
t = int(num_str[i + 1])
u = int(num_str[i + 2])
print()
print(vieni[i])
g = groups - (i // 3 + 1)
if h >= 1:
words.append(vieni[h])
words.append("hundred")
if int(num_str) % 100:
words.append("and")
if t > 1:
words.append(desmiti[t])
if u >= 1:
words.append(vieni[u])
elif t == 1:
if u >= 1:
words.append(padsmiti[u])
else:
words.append(desmiti[t])
else:
if u >= 1:
words.append(vieni[u])
return " ".join(words)
webpages()
for k, v in sorted(webpages().items(),key=itemgetter(1),reverse=True):
print(k, num_to_words(v))
答案 0 :(得分:0)
您只需要确保链接中存在协议。例如,您在锚标记中抓取的链接很可能是架构较少(前两个http
中没有https
或//
。这通常用于告诉浏览器使用用于检索页面的原始协议(如果您转到http://a.foo.com
,指定为//a.foo.com/bar
的所有链接都将使用http,反之亦然,如果您最初到达页面通过https)。要解决这个问题,你应该测试你从http或https页面检索的hrefs,如果它不存在则添加适当的href(你可以传入用于检索原始页面的协议)。
例如,您可以将get_single_item_data()
更改为:
def get_single_item_data(item_url):
if not item_url.startswith('http:'):
item_url = 'http:' + item_url
source_code = requests.get(item_url)
text = source_code.text
soup = BeautifulSoup(text)
for link in soup.findAll('a'):
href = link.get('href')
if href and href.startswith('http://www.'):
if href:
all_links[href] += 1
print(href)
当然这不会解决所有问题(如果最初没有提供任何//
,如果链接实际上是相对于页面上的ID(以#
开头等)但它可能是一个好的开始 - 网页的清理不是很有趣,也不是微不足道的:(。(另请注意,硬编码的值很糟糕,就像我对http:
所做的那样,但我认为这是公平的传入的URL以http
开头。)
答案 1 :(得分:0)
我的建议是不要同时进行网页排名和网页抓取,你只是在寻找麻烦,当你必须考虑页面排名时你必须考虑你已经抓取过的每一个页面,如果你每次抓取页面都这样做,那么金额有些时候你必须考虑每个页面是被爬网页面的因子(!NoPagesCrawled)。你最好爬几千或多少,然后把它们整理好,只是一个想法:)