从div中提取信息并将其他字段设为父级

时间:2015-07-13 00:55:29

标签: python json web-scraping scrapy

我想从一段相对简单的代码中提取信息,但是某些空格和<br>标签形成我的json文件是错误的。

这是主要内容:

main_div

其代码如下:

<div class="caixanorm">
   <div id="titulo">
      <a href="http://quonde.com.br/club-4/" rel="bookmark" title="Link para CLUB 4">
         <h2>CLUB 4</h2>
         <h3 id="subtitulo">Academia                             </h3>
      </a>
   </div>
   <div id="endereco">
      (61) 3346-7423<br>
      CRS 515, entrada W2                
   </div>
   <div id="servecat">
      Em <a href="http://quonde.com.br/asasul/esporte/academias/" rel="category tag">Academias</a> da  <a href="http://quonde.com.br/quadras/516-515/" rel="tag">516 / 515</a> Sul
   </div>
</div>

这是我的代码:

- item.py

import scrapy

class QuondeItem(scrapy.Item):
    localizacao = scrapy.Field()  #location
    titulo = scrapy.Field()       #title
    subtitulo = scrapy.Field()    #subtitle
    telefone = scrapy.Field()     #phone
    endereco = scrapy.Field()     #address
    categoria = scrapy.Field()    #category
    quadra = scrapy.Field()       #block

- my_spider.py

import scrapy
from quonde.items import QuondeItem


class MySpider(scrapy.Spider):
    name = "quonde"
    allowed_domains = ["quonde.com.br"]
    start_urls = [
        "http://quonde.com.br/quadras/516-515/",

    ]

    def parse(self, response):
        div = response.xpath('//div[@class="caixanorm"]')
        items = []
        for sel in div:
            item = QuondeItem()
            item['localizacao'] = sel.xpath('//h1[@class="inline"]/span/text()').extract()
            item['titulo'] = sel.xpath('//div[@id="titulo"]/a/h2/text()').extract()
            item['subtitulo'] = sel.xpath('//div[@id="titulo"]/a/h3/text()').extract()
            item['telefone'] = sel.xpath('//div[@id="endereco"]/text()[1]').extract()
            item['endereco'] = sel.xpath('//div[@id="endereco"]/text()[2]').extract()
            item['categoria'] = sel.xpath('//div[@id="servecat"]/a[1]/text()').extract()
            item['quadra'] = sel.xpath('//div[@id="servecat"]/a[@rel="tag"]/text()').extract()
            items.append(item)
            return items

正如我们所看到的,items.py的第一个字段没有在div中描述,因为我希望他成为父项,其余的是他的孩子......但是,这就是我得到的:JSON Result。电话和地址附带HTML字符和空格,我无法将每个块的位置作为所有其他块的父亲(explanation)。

除此之外,我想知道json本身的形成是否正确,例如,标题0对应于0字幕,除了它不应该只在一个单元格中而是为另一个单元格重复?

对不起英文,谢谢!

1 个答案:

答案 0 :(得分:1)

这里的关键问题是你的XPath表达式与当前选择器无关 - 你需要在每个表达式的开头加上

此外,您不需要在循环中提取位置,请先执行此操作。

此外,为了美化提取的字段,请使用Item Loader输入和输出处理器:

import scrapy
from scrapy.contrib.loader import ItemLoader
from scrapy.contrib.loader.processor import TakeFirst, MapCompose


class QuondeItem(scrapy.Item):
    localizacao = scrapy.Field()  #location
    titulo = scrapy.Field()       #title
    subtitulo = scrapy.Field()    #subtitle
    telefone = scrapy.Field()     #phone
    endereco = scrapy.Field()     #address
    categoria = scrapy.Field()    #category
    quadra = scrapy.Field()       #block


class QuondeItemLoader(ItemLoader):
    default_input_processor = MapCompose(unicode.strip)
    default_output_processor = TakeFirst()

修改后的蜘蛛代码:

import scrapy
from quonde.items import QuondeItem, QuondeItemLoader


class MySpider(scrapy.Spider):
    name = "quonde"
    allowed_domains = ["quonde.com.br"]
    start_urls = [
        "http://quonde.com.br/quadras/516-515/",
    ]

    def parse(self, response):
        div = response.xpath('//div[@class="caixanorm"]')
        location = response.xpath('.//h1[@class="inline"]/span/text()').extract()[0]
        for sel in div:
            loader = QuondeItemLoader(QuondeItem(), selector=sel)

            loader.add_value("localizacao", location)
            loader.add_xpath("titulo", './/div[@id="titulo"]/a/h2/text()')
            loader.add_xpath("subtitulo", './/div[@id="titulo"]/a/h3/text()')
            loader.add_xpath("telefone", './/div[@id="endereco"]/text()[1]')
            loader.add_xpath("endereco", './/div[@id="endereco"]/text()[2]')
            loader.add_xpath("categoria", './/div[@id="servecat"]/a[1]/text()')
            loader.add_xpath("quadra", './/div[@id="servecat"]/a[@rel="tag"]/text()')

            yield loader.load_item()

这是生成的JSON输出:

[{"subtitulo": "Laborat\u00f3rio", "categoria": "Cl\u00ednicas e Consult\u00f3rios", "quadra": "516 / 515", "telefone": "(61) 3245-1275", "endereco": "CRS 515, Bl. B, Loja 77", "titulo": "Micra", "localizacao": "516 / 515"},
{"subtitulo": "Pneus e Rodas", "categoria": "Autom\u00f3veis", "quadra": "516 / 515", "telefone": "(61) 3346-1666", "endereco": "CRS 515, Bl. B, Loja 14", "titulo": "Impacto", "localizacao": "516 / 515"},
...
{"subtitulo": "Cons\u00f3rcios", "categoria": "Consultorias e Assessorias", "quadra": "516 / 515", "telefone": "(61) 3346-8073", "endereco": "SHCS 516, Bl. C, Lj. 75", "titulo": "FERRAZ", "localizacao": "516 / 515"},
{"subtitulo": "Tape\u00e7aria", "categoria": "Decora\u00e7\u00f5es e Molduras", "quadra": "516 / 515", "telefone": "(61) 3245-3888", "endereco": "SHCS 516, Bl. C, Lj. 56", "titulo": "MUNDO DOS TAPETES", "localizacao": "516 / 515"}]