Question

我是scrapy的新手，我正在努力建造一个必须做这种工作的蜘蛛：

以递归方式和特定深度从通用网页中提取所有链接。

我尝试使用以下代码执行此操作：

class MySpider(CrawlSpider):
    settings.overrides['DEPTH_LIMIT'] = 1
    name = "cnet"
    allowed_domains = ["cnet.com"]
    start_urls = ["http://www.cnet.com/"]

    rules = (Rule (SgmlLinkExtractor(allow_domains=('cnet.com',)), callback="parse_items", follow= True),)

    def parse_items(self, response):
        print ""
        print "PARSE ITEMS"
        print ""

        hxs = HtmlXPathSelector(response)
        titles = hxs.select('//a')
        items = []
        for titles in titles:
            item = NewsItem()
            item ["title"] = titles.select("text()").extract()
            item ["link"] = titles.select("@href").extract()

            if(len(item["link"]) > 0) and (self.allowed_domains[0] in item["link"][0]):
                print ""
                print response.meta['depth']
                print item ["title"]
                print item ["link"]
                print ""

            items.append(item)
        return(items)

但它似乎继续INFINITE循环，有什么建议吗？

非常感谢！

Scrapy递归爬行与最大DEPTH - 无限循环

0 个答案: