我是python的新手。当我试图运行只抓取页面中链接的爬虫时,我收到了这个错误。我安装了python 2.7并在osx上工作。我的抓取工具做的是,它转到页面并尝试查找该页面中存在的所有链接,并将所有这些链接存储在列表中。接下来,我们尝试抓取所有新链接,并继续重复这些链接,直到没有要抓取的链接。
File "crawler.py", line 44, in <module>
print crawl_web("https://en.wikipedia.org/wiki/Devil_May_Cry_4")
File "crawler.py", line 7, in crawl_web
union(tocrawl,get_all_links(get_page(page)))
File "crawler.py", line 19, in get_page
response = urllib.urlopen(a)
File" /Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/urllib.py", line 87, in urlopen
return opener.open(url)
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/urllib.py", line 213, in open
return getattr(self, name)(url)
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/urllib.py", line 469, in open_file
return self.open_local_file(url)
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/urllib.py", line 483, in open_local_file
raise IOError(e.errno, e.strerror, e.filename)
IOError: [Errno 2] No such file or directory:'/w/load.phpdebug=false&lang=en&
modules=ext.cite.styles|ext.gadget.DRN-wizard,ReferenceTooltips,charinsert,featured-
articleslinks,refToolbar,switcher,teahouse|ext.wikimediaBadges&only=styles&
skin=vector'
这是我运行的代码
def crawl_web(page):
tocrawl = [page]
crawled = []
while tocrawl:
page = tocrawl.pop()
if page not in crawled:
union(tocrawl,get_all_links(get_page(page)))
crawled.append(page)
return crawled
def union(a,b):
for e in b:
if e not in a:
a.append(e)
import urllib
def get_page(a):
response = urllib.urlopen(a)
data = response.read()
return data
def get_all_links(page):
links = []
while True:
url,endpos = get_next_target(page)
if url:
links.append(url)
page = page[endpos:]
else:
break
return links
def get_next_target(page):
start_link = page.find('href=')
if start_link == -1:
return None,0
start_quote = page.find('"',start_link)
end_quote = page.find('"',start_quote+1)
url = page[start_quote +1:end_quote]
return url,end_quote
print crawl_web("https://en.wikipedia.org/wiki/Devil_May_Cry_4")'
答案 0 :(得分:0)
您的实施存在缺陷。 html中的链接可以是相对的,例如/index.php
//en.wikipedia.org/index.php
,因此您必须检测相对链接并添加协议和主机前缀。