Question

我正在抓this site而我正在使用Scrapy作为手段。但是，我遇到了XPath问题。我不完全确定发生了什么：

为什么这样做：

def parse_item(self, response):
    item = BotItem()

    for title in response.xpath('//h1'):
        item['title'] = title.xpath('strong/text()').extract()
        item['wage'] = title.xpath('span[@class="price"]/text()').extract()
        yield item

以下代码没有？

def parse_item(self, response):
    item = BotItem()

    for title in response.xpath('//body'):
        item['title'] = title.xpath('h1/strong/text()').extract()
        item['wage'] = title.xpath('h1/span[@class="price"]/text()').extract()
        yield item

我的目标也是提取XPath：

//div[@id="description"]/p

但我不能，因为它在h1节点之外。我怎样才能做到这一点？我的完整代码是：

import scrapy
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule

from bot.items import BotItem


class MufmufSpider(CrawlSpider):
    name = 'mufmuf'
    allowed_domains = ['mufmuf.ro']
    start_urls = ['http://mufmuf.ro/locuri-de-munca/joburi-in-strainatate/']

    rules = (
        Rule(
            LinkExtractor(restrict_xpaths='//div[@class="paginate"][position() = last()]'), 
            #callback='parse_start_url', 
            follow=True
        ),
        Rule(
            LinkExtractor(restrict_xpaths='//h3/a'), 
            callback='parse_item', 
            follow=True
        ),

    def parse_item(self, response):
        item = BotItem()

        for title in response.xpath('//h1'):
            item['title'] = title.xpath('strong/text()').extract()
            item['wage'] = title.xpath('span[@class="price"]/text()').extract()
            #item['description'] = title.xpath('div[@id="descirption"]/p/text()').extract()
            yield item

Answer 1

for title in response.xpath('//body'):选项不起作用，因为循环中的XPath表达式使其直接在h1元素内搜索body元素。

此外，由于只提取了一个所需的实体，因此根本不需要循环：

def parse_item(self, response):
    item = BotItem()

    item["title"] = response.xpath('//h1/strong/text()').extract()
    item["wage"] = response.xpath('//h1/span[@class="price"]/text()').extract()
    item["description"] = response.xpath('//div[@id="description"]/p/text()').extract()

    return item

（这也应该回答关于description）的第二个问题

Scrapy XPath选择器

1 个答案: