我正在使用scrapy框架,我无法从我分析的网页加载其他数据,它有一个标签可以看到更多。你能告诉我你能做些什么,谢谢你。
import scrapy
from scrapy.spiders import CrawlSpider,Rule
from scrapy.linkextractors import LinkExtractor
from prueba1.items import Prueba1Item
from scrapy.exceptions import CloseSpider
class PruebaSpider(CrawlSpider):
name = 'prueba1'
item_count = 0
allowed_domain = ['http://www.abc.com.py/']
start_urls = ['http://www.abc.com.py/buscar/?buscar=Santiago+Pe%C3%B1a'
'http://www.abc.com.py/buscar/?buscar=Santi+Pe%C3%B1a',
'http://www.abc.com.py/buscar/?buscar=santiago+pe%C3%B1a',
'http://www.abc.com.py/buscar/?buscar=santi+pe%C3%B1a']
rules = {
Rule(LinkExtractor(allow =(),canonicalize = True, unique =
True,restrict_xpaths=('//html/body/div/a[@id="load-more"]'))),
Rule(LinkExtractor(allow =(),canonicalize = True, unique =
True,restrict_xpaths=('//div[@class="article"]')),
callback = 'parse_item', follow=True)
}
def parse_item(self, response):
ml_item=Prueba1Item()
ml_item['article'] = response.xpath('normalize-space(//h1)').extract()
ml_item['fecha'] = response.xpath('normalize-
space(//small)').extract()
ml_item['contenido'] = response.xpath('normalize-
space(//p[@class="summary"])').extract()
ml_item['contenido2'] = response.xpath('normalize-
space(//div[@class="text"])').extract()
ml_item['url'] = response.xpath('normalize-
space(//link/@href)').extract()
ml_item['comentarioFacebook'] = response.xpath('normalize-
space(//div[@class="_30o4"]/span/span[@class="_5mdd"])').extract()
self .item_count += 1
if self.item_count > 50:
raise CloseSpider('item_exceeded')
yield ml_item
根据搜索结果,我有超过4000个结果,但我不能使用此代码带来超过50个。
答案 0 :(得分:0)
使用以下格式的json动态加载内容:
{
"titulo": "Tuma se suma a Marito",
"copete": "El diputado \u00d3scar Tuma oficializ\u00f3 su respaldo a la candidatura de Mario Abdo Ben\u00edtez a la presidencia, ya que sus reportes indicaron que el candidato de Colorado A\u00f1etet\u00e9 tiene mayor intenci\u00f3n de votos. El 100% de su dirigencia se lo pidi\u00f3, dice. ",
"publicacion": "09-11-2017 08:00",
"imagen": "2017\/10\/02\/el-diputado-scar-tuma-inscribio-ayer-las-precandidaturas-de-su-movimiento-tu-asuncion-en-la-junta-de-gobierno--200750000000-1634915.jpg",
"url": "nacionales\/tuma-se-suma-a-marito-1648202.html",
"autor": "",
"hits": "3163",
"comentarios": "1",
"corte_url": "https:\/\/s3-sa-east-1.amazonaws.com\/assets.abc.com.py\/2017\/10\/02\/_146_162_1542245.jpg",
"corte_width": 146,
"corte_height": 162,
"autor_nombre": null,
"autor_url": "http:\/\/www.abc.com.py\/autor\/-.html",
"total": "4861"
}
因此,您可以直接从以下网址获取json格式的数据,而无需使用Xpath或css选择器(尝试在浏览器中打开此网址): http://www.abc.com.py/ajax.php?seccion=busqueda-avanzada&tipo=4&tipoplant=0&buscar=Santiago+Pe%C3%B1a&desde=&hasta=&seccion-noticia=&temas=&begin=0&limit=7&aditional=
我认为修改网址以获取所需数据并将其放入脚本并不困难,例如,从第二项中获取10项,只需将begin
更改为2和{{1}到10: