Question

我对Scrapy很陌生，我只是好奇为什么我的刮刀不起作用。这是我的代码：

import scrapy

from tutorial.items import TutorialItem

class tutSpider(scrapy.Spider):
    name = "tutorial"
    allowed_domains = ["backpage.com"]
    start_urls = [
        "http://chicago.backpage.com/FemaleEscorts/naughtiest-_girl-next-door/20557457"
    ]

    def parse(self, response):
        # sel = response.xpath('//*')
        item = TutorialItem()
        item['title'] = response.xpath('//div[@id="postingTitle"]/h1/text()').extract()
        item['link'] = response.xpath('a/@href').extract()
        item['desc'] = response.xpath('//body/div[@id="postingBody"]/text()').extract()
        yield item

它产生以下JSON文件：

[{"title": [], "link": [], "desc": []}]

我确信它无法找到我指示的指定元素，即使我100％确定这些div ID是有效的。它们属于身体内的其他div。

Answer 1

正如您所猜测的那样，问题在于xpath本身。

对于item['title'] = response.xpath('//div[@id="postingTitle"]/a/h1/text()').extract() item['link'] = response.xpath('//div[@id="postingBody"]/a/@href').extract() item['desc'] = response.xpath('//div[@id="postingBody"]//text()').extract()，h1节点位于您使用的xpath中不存在的节点内。所以一定是

<img src="/notfound.jpg" />

正如@Jarrod Roberson指出的那样，有很多工具可以提供xpath并验证它们

如果您正在使用firefox和firebug，请尝试使用firepath。在将它们放入蜘蛛之前尝试xpaths总是很好的

Scrapy xpath不工作

1 个答案: