Question

我们看到：

def parse(self, response):
    hxs = HtmlXPathSelector(response)
    sites = hxs.select('//ul/li')
    items = []

    for site in sites:
        item = Website()
        item['name'] = site.select('a/text()').extract()
        item['url'] = site.select('//a[contains(@href, "http")]/@href').extract()
        item['description'] = site.select('text()').extract()
        items.append(item)

    return items

scrapy只是得到一个页面响应，并在页面响应中找到URL。我认为这只是表面爬行!!

但我希望更多的网址具有明确的深度。

我该怎么做才能实现呢？

谢谢!!

Answer 1

我不理解您的问题，但我发现您的代码中存在一些问题，其中一些可能与您的问题有关（请参阅代码中的注释）：

sites = hxs.select('//ul/li')
items = []

for site in sites:
    item = Website()
    # this extracts a list, so i guess .extract()[0] is expected
    item['name'] = site.select('a/text()').extract() 
    # '//a[...]' maybe you expect that this gets the links within the `site`, but it actually get the links from the entire page; you should use './/a[...]'.
    # And, again, this returns a list, not a single url.
    item['url'] = site.select('//a[contains(@href, "http")]/@href').extract()

Answer 2

查看documentation on Requests and Responses。

当您抓第一页时，您会收集一些用于生成第二个请求的链接，并导致第二个回调函数刮掉第二个级别。在抽象中听起来很复杂，但你会从the example code in the documentation看到它非常简单。

此外，CrawlSpider example更加充实，并为您提供模板代码，您可能只想根据自己的情况进行调整。

希望这能让你开始。

Answer 3

您可以使用CrawlSpider抓取更多网页，scrapy.contrib.spiders可以从rules导入，并定义{{1}}您希望抓取工具抓取哪种类型的链接。

按照说明here了解如何定义规则

顺便提一下，请考虑从docs：

更改函数名称

警告

编写爬网蜘蛛规则时，请避免使用parse作为回调   CrawlSpider使用parse方法本身来实现其逻辑。   因此，如果您覆盖解析方法，则爬行蜘蛛将不再存在   工作

scrapy如何抓取更多网址？

3 个答案: