scrapy递归链接爬虫与登录 - 帮助我改进

时间:2016-07-27 17:41:33

标签: recursion scrapy scrapy-spider

就我目前的知识而言,我编写了一个小型的Web蜘蛛/爬虫,能够以可变的嵌套深度递归爬行,也能够在爬行之前进行可选的POST / GET预登录(如果需要)。 / p>

由于我是一个完全的初学者,我想得到一些反馈,改进或任何你的投入。

我只在这里添加parser功能。整个源代码可以在github上查看:https://github.com/cytopia/crawlpy

我真正想要确保的是,与yield结合使用的递归效率尽可能高,并且我也是以正确的方式进行的。

对此的任何评论和编码风格都非常受欢迎。

def parse(self, response):
    """
    Scrapy parse callback
    """

    # Get current nesting level
    if response.meta.has_key('depth'):
        curr_depth = response.meta['depth']
    else:
        curr_depth = 1


    # Only crawl the current page if we hit a HTTP-200
    if response.status == 200:
        hxs = Selector(response)
        links = hxs.xpath("//a/@href").extract()

        # We stored already crawled links in this list
        crawled_links = []

        # Pattern to check proper link
        linkPattern  = re.compile("^(?:http|https):\/\/(?:[\w\.\-\+]+:{0,1}[\w\.\-\+]*@)?(?:[a-z0-9\-\.]+)(?::[0-9]+)?(?:\/|\/(?:[\w#!:\.\?\+=&%@!\-\/\(\)]+)|\?(?:[\w#!:\.\?\+=&%@!\-\/\(\)]+))?$")

        for link in links:

            # Link could be a relative url from response.url
            # such as link: '../test', respo.url: http://dom.tld/foo/bar
            if link.find('../') == 0:
                link = response.url + '/' + link
            # Prepend BASE URL if it does not have it
            elif 'http://' not in link and 'https://' not in link:
                link = self.base_url + link


            # If it is a proper link and is not checked yet, yield it to the Spider
            if (link
                    and linkPattern.match(link)
                    and link.find(self.base_url) == 0):
                    #and link not in crawled_links
                    #and link not in uniques):

                # Check if this url already exists
                re_exists = re.compile('^' + link + '$')
                exists = False
                for i in self.uniques:
                    if re_exists.match(i):
                        exists = True
                        break

                if not exists:
                    # Store the shit
                    crawled_links.append(link)
                    self.uniques.append(link)

                    # Do we recurse?
                    if curr_depth < self.depth:
                        request = Request(link, self.parse)
                        # Add meta-data about the current recursion depth
                        request.meta['depth'] = curr_depth + 1
                        yield request
                    else:
                        # Nesting level too deep
                        pass
            else:
                # Link not in condition
                pass


        #
        # Final return (yield) to user
        #
        for url in crawled_links:
            #print "FINAL FINAL FINAL URL: " + response.url
            item = CrawlpyItem()
            item['url'] = url
            item['depth'] = curr_depth

            yield item
        #print "FINAL FINAL FINAL URL: " + response.url
        #item = CrawlpyItem()
        #item['url'] = response.url
        #yield item
    else:
        # NOT HTTP 200
        pass

1 个答案:

答案 0 :(得分:2)

您的整个代码可以简化为:

from scrapy.linkextractors import LinkExtractor
def parse(self, response):
    # Get current nesting level
    curr_depth = response.meta.get('depth',1)
    item = CrawlpyItem()  # could also just be `item = dict()`
    item['url'] = response.url
    item['depth'] = curr_depth
    yield item

    links = LinkExtractor().extract_links(response)
    for link in links:
        yield Request(link.url, meta={'depth': curr_depth+1})

如果我理解你想要做什么,这里广泛抓取所有网址,产生深度和网址作为项目吗?

Scrapy默认启用了dupe过滤器,因此您不需要自己执行该逻辑。此外,您的parse()方法除了响应200之外永远不会收到任何内容,因此检查无用。

编辑:返工以避免欺骗。

相关问题