因为它在xpath中被引用到另一个视图标记

时间:2017-11-13 05:41:07

标签: xpath scrapy

我正在使用scrapy框架,我无法从我分析的网页加载其他数据,它有一个标签可以看到更多。你能告诉我你能做些什么,谢谢你。

import scrapy
from scrapy.spiders import CrawlSpider,Rule
from scrapy.linkextractors import LinkExtractor
from prueba1.items import Prueba1Item
from scrapy.exceptions import CloseSpider


class PruebaSpider(CrawlSpider):
    name = 'prueba1'
    item_count = 0
    allowed_domain = ['http://www.abc.com.py/']
    start_urls = ['http://www.abc.com.py/buscar/?buscar=Santiago+Pe%C3%B1a'
    'http://www.abc.com.py/buscar/?buscar=Santi+Pe%C3%B1a',
    'http://www.abc.com.py/buscar/?buscar=santiago+pe%C3%B1a',
    'http://www.abc.com.py/buscar/?buscar=santi+pe%C3%B1a']

    rules = {


    Rule(LinkExtractor(allow =(),canonicalize = True, unique = 
    True,restrict_xpaths=('//html/body/div/a[@id="load-more"]'))),
    Rule(LinkExtractor(allow =(),canonicalize = True, unique = 
    True,restrict_xpaths=('//div[@class="article"]')),
    callback = 'parse_item', follow=True)
    }

    def parse_item(self, response):
        ml_item=Prueba1Item()

        ml_item['article'] = response.xpath('normalize-space(//h1)').extract()
        ml_item['fecha'] = response.xpath('normalize-
        space(//small)').extract()
       ml_item['contenido'] = response.xpath('normalize-
       space(//p[@class="summary"])').extract()
       ml_item['contenido2'] = response.xpath('normalize-
       space(//div[@class="text"])').extract()
       ml_item['url'] = response.xpath('normalize-
       space(//link/@href)').extract()
       ml_item['comentarioFacebook'] = response.xpath('normalize-
       space(//div[@class="_30o4"]/span/span[@class="_5mdd"])').extract()

       self .item_count += 1
       if self.item_count > 50:
           raise CloseSpider('item_exceeded')
        yield ml_item

根据搜索结果,我有超过4000个结果,但我不能使用此代码带来超过50个。

enter image description here

enter image description here

enter image description here

1 个答案:

答案 0 :(得分:0)

使用以下格式的json动态加载内容:

   {
      "titulo": "Tuma se suma a Marito",
      "copete": "El diputado \u00d3scar Tuma oficializ\u00f3 su respaldo a la candidatura de Mario Abdo Ben\u00edtez a la presidencia, ya que sus reportes indicaron que el candidato de Colorado A\u00f1etet\u00e9 tiene mayor intenci\u00f3n de votos. El 100% de su dirigencia se lo pidi\u00f3, dice. ",
      "publicacion": "09-11-2017 08:00",
      "imagen": "2017\/10\/02\/el-diputado-scar-tuma-inscribio-ayer-las-precandidaturas-de-su-movimiento-tu-asuncion-en-la-junta-de-gobierno--200750000000-1634915.jpg",
      "url": "nacionales\/tuma-se-suma-a-marito-1648202.html",
      "autor": "",
      "hits": "3163",
      "comentarios": "1",
      "corte_url": "https:\/\/s3-sa-east-1.amazonaws.com\/assets.abc.com.py\/2017\/10\/02\/_146_162_1542245.jpg",
      "corte_width": 146,
      "corte_height": 162,
      "autor_nombre": null,
      "autor_url": "http:\/\/www.abc.com.py\/autor\/-.html",
      "total": "4861"
    }

因此,您可以直接从以下网址获取json格式的数据,而无需使用Xpath或css选择器(尝试在浏览器中打开此网址): http://www.abc.com.py/ajax.php?seccion=busqueda-avanzada&tipo=4&tipoplant=0&buscar=Santiago+Pe%C3%B1a&desde=&hasta=&seccion-noticia=&temas=&begin=0&limit=7&aditional=

我认为修改网址以获取所需数据并将其放入脚本并不困难,例如,从第二项中获取10项,只需将begin更改为2和{{1}到10:

http://www.abc.com.py/ajax.php?seccion=busqueda-avanzada&tipo=4&tipoplant=0&buscar=Santiago+Pe%C3%B1a&desde=&hasta=&seccion-noticia=&temas=&begin=2&limit=10&aditional=