response.xpath()。extract()返回字符串列表而不是DOM片段

时间:2018-06-08 04:40:28

标签: python python-2.7 xpath scrapy response

我问这个问题的唯一原因是因为我更倾向于使用response.xpath代替response.css

以下是我编写的脚本的工作副本,该脚本使用response.css,我想将其翻译为response.xpath

import scrapy

class RentalsCrawler(scrapy.Spider):
    name = "rentals"
    allowed_domains = [
        'craigslist.org',
        'kajiji.ca'
    ]
    custom_settings = {
        'DOWNLOAD_DELAY': 2,
        'CONCURRENT_REQUESTS_PER_DOMAIN': 2,
    }
    handle_httpstatus_list = [404]
    def start_requests(self):
        start = 0
        nopgs = 1
        pages = []
        for i in range(0, nopgs):
            i = i * 120 + start
            pages.append('https://vancouver.craigslist.ca/search/apa?s=' + str(i))
        for page in pages:
            yield scrapy.Request(url=page, callback=self.parse)
    def parse(self, response):
        for li in response.css('ul.rows li p span.result-meta'):
            prc = li.css('span.result-price::text').extract_first()
            sqf = li.css('span.housing::text').extract_first()
            loc = li.css('span.result-hood::text').extract_first()
            objct = { 'prc': prc }
            if sqf:
                objct['sqf'] = sqf
            if loc:
                objct['loc'] = loc
            yield objct

问题是以下代码返回 strings 的列表,而不是我可以使用XPath进一步解析的DOM片段。

def parse(self, response):
    dad_path = '//ul[@class="rows"]/li/p/span[@class="result-meta"]'
    dad_resp = response.xpath(dad_path).extract()
    ...

上面的代码以下列格式生成输出,其中每个dict元素中的值是一个巨大的字符串,而不是可以进一步解析的DOM片段。

[
  {
    "a": "<span class=\"result-meta\">\n                <span class=\"result-price\">$3750</span>\n\n                <span class=\"housing\">\n                    4br -\n                    2150ft<sup>2</sup> -\n                </span>\n\n                <span class=\"result-hood\"> (DUNDARAVE)</span>\n\n                <span class=\"result-tags\">\n                    pic\n                    <span class=\"maptag\" data-pid=\"6611144029\">map</span>\n                </span>\n\n                <span class=\"banish icon icon-trash\" role=\"button\">\n                    <span class=\"screen-reader-text\">hide this posting</span>\n                </span>\n\n            <span class=\"unbanish icon icon-trash red\" role=\"button\" aria-hidden=\"true\"></span>\n            <a href=\"#\" class=\"restore-link\">\n                <span class=\"restore-narrow-text\">restore</span>\n                <span class=\"restore-wide-text\">restore this posting</span>\n            </a>\n\n        </span>"
  }
  ...

因此可以说以下代码“出现故障”。

for span in dad_resp:
    prc_path = '//span[@class="result-meta"]/span[@class="result-price"]/text()'
    prc_resp = response.xpath(prc_path).extract_first()
    yield {
        'prc': prc_resp
    }

并产生这样的输出。

[
  {
    "prc": "$3750"
  },
  {
    "prc": "$3750"
  },
  {
    "prc": "$3750"
  },
  ...

然而,如果您将response.xpath(prc_path).extract_first()更改为response.xpath(prc_path).extract(),它确实会提取所有不同的价格,而不是一遍又一遍地重复相同的价格。但是,它将所有价格汇总到一个列表中。我想做的是将每件商品的价格,平方英尺等等分成单独的字样。

知道如何做到这一点?我做错了什么?!

0 个答案:

没有答案