python web爬虫中的递归

时间:2012-09-18 19:59:51

标签: python

我试图在python中创建一个小型的Web爬虫。现在似乎绊倒我的是这个问题的递归部分和深度。给出一个url和maxDepth,我想要链接到那里的网站数量,然后将url添加到搜索网站集中,并从网站下载所有文本和链接。对于网址中包含的所有链接,我想搜索每个链接并获取它的单词和链接。问题是,当我递归调用下一个url时,深度已经达到maxDepth,并且在再去一页之后就停止了。希望我能够体面地解释它,基本上我问的问题是如何进行所有递归调用然后设置self._depth + = 1?

def crawl(self,url,maxDepth):        

    self._listOfCrawled.add(url)

    text = crawler_util.textFromURL(url).split()

    for each in text:
        self._index[each] = url

    links = crawler_util.linksFromURL(url)

    if self._depth < maxDepth:
        self._depth = self._depth + 1
        for i in links:
            if i not in self._listOfCrawled:
                self.crawl(i,maxDepth) 

1 个答案:

答案 0 :(得分:4)

代码的问题在于,每次调用函数时都会增加self.depth,并且由于它是实例的变量,因此在以下调用中保持增加。假设maxDepth为3,您有一个网址A,其链接到BCB链接到DCE的链接。您的调用层次结构如下所示(假设self._depth在开头为0):

crawl(self, A, 3)          # self._depth set to 1, following links to B and C
    crawl(self, B, 3)      # self._depth set to 2, following link to D
        crawl(self, D, 3)  # self._depth set to 3, no links to follow
    crawl(self, C, 3)      # self._depth >= maxDepth, skipping link to E

换句话说,代替当前通话的depth,您可以跟踪累计crawl的通话次数。

相反,尝试这样的事情:

def crawl(self,url,depthToGo):
    # call this method with depthToGo set to maxDepth
    self._listOfCrawled.add(url)
    text = crawler_util.textFromURL(url).split()
    for each in text:
        # if word not in index, create a new set, then add URL to set
        if each not in self._index:
            self._index[each] = set([])
        self._index[each].add(url)
    links = crawler_util.linksFromURL(url)
    # check if we can go deeper
    if depthToGo > 0:
        for i in links:
            if i not in self._listOfCrawled:
                # decrease depthToGo for next level of recursion
                self.crawl(i, depthToGo - 1)