Question

我正在使用scrapy来尝试在嵌套类中检索url数据。我曾尝试过这些教程和类似的问题，但是在我看似简单的任务中我做得不够。

我试图抓的页面就是这个： https://github.com/ICanBoogie/CLDR

对于页面上的每个车辆，我想获得xpath，它们会导致“data-nice_url”文本。因此，第一个结果应该是“/ privatleasing / Citro％c3％abn-Berlingo / eHDi-90-Seduction-E6G”。但我每次都得到一个空的数据集。我试过改变xpath而没有任何外观。

我的代码如下所示：

from scrapy.spiders import Spider
from stack.items import StackItem
from scrapy.selector import Selector


class Spider(Spider):
name = "leasingcar"
allowed_domains = ["http://www.leasingcar.dk"]
start_urls = ["http://www.leasingcar.dk/privatleasing",]

def parse(self, response):

    hxs = Selector(response)
    print hxs.xpath('//div[@class="data-nice_url"]/text()').extract()

提前致谢

Answer 1

页面非常“动态”，并使用多个XHR请求到不同的API端点来构建自身。在浏览器开发人员工具中查看这些请求之后，我会说您在Scrapy代码中模拟这些请求并不容易，并且使用selenium - 浏览器自动化工具来解决问题要容易得多。您还可以使用"headless" PhantomJS browser或virtual display。

在任何情况下，请确保您没有违反网站的使用条款，并且您是good web-scraping citizen。

使用Scrapy检索嵌套数据

1 个答案: