class Crawler1(object):
def __init__(self):
'constructor'
self.visited = []
self.will_visit = []
def reset(self):
'reset the visited links'
self.visited = []
self.will_visit = []
def crawl(self, url, n):
'crawl to depth n starting at url'
self.analyze(url)
if n < 0:
self.reset()
elif url in self.visted:
self.crawl(self.will_visit[-1],n-1)
else:
self.visited.append(url)
self.analyze(url)
self.visited.append(url)
self.will_visit.pop(-1)
self.crawl(self.will_visit[-1],n-1)
def analyze(self, url):
'returns the list of URLs found in the page url'
print("Visiting", url)
content = urlopen(url).read().decode()
collector = Collector(url)
collector.feed(content)
urls = collector.getLinks()
for i in urls:
if i in self.will_visit:
pass
else:
self.will_visit.append(i)
我希望这个程序能够通过一系列链接运行,但只有“n”允许它
我不确定代码有什么问题,但我确信它很多。一些提示会很好。
如果n = 1则预期输出,而Site1上有Site2和Site3的链接:
Visiting [Site1]
Visiting [Site2]
Visiting [Site3]
答案 0 :(得分:2)
您需要仔细考虑它应该如何表现,尤其是它决定如何抓取到另一个页面。此代码集中在crawl
方法:
如果n < 0
,那么您已经爬得足够深,不想做任何事情。所以在这种情况下简单回归。
否则,请分析页面。然后,您希望抓取每个新网址,深度为n-1
。
我认为,部分混乱是你要保持一个网址队列访问,但也递归爬行。首先,这意味着队列不仅包含您要按顺序访问的最后一个已爬网URL的子项,还包含已爬网但尚未完全处理的其他节点的子项。很难以这种方式管理深度优先搜索的形状。
相反,我会删除will_visit
变量,并让analyze
返回找到的链接列表。然后根据上面的步骤2处理该列表,例如:
# Crawl this page and process its links
child_urls = self.analyze(url)
for u in child_urls:
if u in self.visited:
continue # Do nothing, because it's already been visited
self.crawl(u, n-1)
为了实现这个目的,您还需要更改analyze
以简单地返回网址列表,而不是将它们放入堆栈中:
def analyze(self, url):
...
urls = collector.getLinks()
returns urls