Scrapy XPath选择器

时间:2015-07-22 21:51:10

标签: python xpath scrapy

我正在抓this site而我正在使用Scrapy作为手段。但是,我遇到了XPath问题。我不完全确定发生了什么:

为什么这样做:

def parse_item(self, response):
    item = BotItem()

    for title in response.xpath('//h1'):
        item['title'] = title.xpath('strong/text()').extract()
        item['wage'] = title.xpath('span[@class="price"]/text()').extract()
        yield item

以下代码没有?

def parse_item(self, response):
    item = BotItem()

    for title in response.xpath('//body'):
        item['title'] = title.xpath('h1/strong/text()').extract()
        item['wage'] = title.xpath('h1/span[@class="price"]/text()').extract()
        yield item

我的目标也是提取XPath:

//div[@id="description"]/p

但我不能,因为它在h1节点之外。我怎样才能做到这一点?我的完整代码是:

import scrapy
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule

from bot.items import BotItem


class MufmufSpider(CrawlSpider):
    name = 'mufmuf'
    allowed_domains = ['mufmuf.ro']
    start_urls = ['http://mufmuf.ro/locuri-de-munca/joburi-in-strainatate/']

    rules = (
        Rule(
            LinkExtractor(restrict_xpaths='//div[@class="paginate"][position() = last()]'), 
            #callback='parse_start_url', 
            follow=True
        ),
        Rule(
            LinkExtractor(restrict_xpaths='//h3/a'), 
            callback='parse_item', 
            follow=True
        ),

    def parse_item(self, response):
        item = BotItem()

        for title in response.xpath('//h1'):
            item['title'] = title.xpath('strong/text()').extract()
            item['wage'] = title.xpath('span[@class="price"]/text()').extract()
            #item['description'] = title.xpath('div[@id="descirption"]/p/text()').extract()
            yield item

1 个答案:

答案 0 :(得分:4)

for title in response.xpath('//body'):选项不起作用,因为循环中的XPath表达式使其直接在h1元素内搜索body元素。

此外,由于只提取了一个所需的实体,因此根本不需要循环:

def parse_item(self, response):
    item = BotItem()

    item["title"] = response.xpath('//h1/strong/text()').extract()
    item["wage"] = response.xpath('//h1/span[@class="price"]/text()').extract()
    item["description"] = response.xpath('//div[@id="description"]/p/text()').extract()

    return item

(这也应该回答关于description)的第二个问题