Scrapy在特定的解析中运行javascript脚本

时间:2017-07-06 16:37:14

标签: javascript python scrapy splash

我最近开始使用scrapy,当我抓取页面时,我遇到了显示的货币问题。这是我需要点击以更改实际货币的页面部分: Change currency

我可以这样做,但使用下一个javascript脚本:     setCurrency(1) 我必须在所有页面中执行此操作,以确保每个产品使用相同的货币。从我读过的内容来看,我可以使用splash with scrapy来做到这一点,但是,如果我错了,请纠正我,但代码只能运行蜘蛛的构造函数一次。 有没有办法只在提取产品的解析上运行它?我的代码现在是这样的:

def parse(self, response):
    rutas = response.xpath("//a[@class='nonblock nontext rounded-corners rgba-background clip_frame grpelem']/@href").extract()
    for ruta in rutas:
        ruta_abs = response.urljoin(ruta)
        yield scrapy.Request(url=ruta_abs, callback=self.parse_producto)

def parse_producto(self, response):
    #Se debe ejecutar antes un script para cambiar la divisa
    nombre = response.xpath("//h1/text()").extract_first()
    #También conocido como "Referencia" por la página:
    codigo = response.xpath("//p[@id='product_reference']/span/text()").extract_first()
    descripcion = response.xpath("//div[@id='short_description_block']/div/p/text()").extract_first()
    url_foto = response.xpath("//div[@id='image-block']/span/img/@src").extract_first()
    precio = '.'.join(response.xpath("//span[@id='our_price_display']/text()").re(r"\d+"))
    categorias = response.xpath("//span[@class='navigation_page']/span/a/span/text()").extract()
    categoria_actual = ''
    num_categorias = len(categorias)
    if num_categorias > 1:
        num_categorias = (num_categorias-1)*-1
        categoria_actual = categorias[num_categorias]
    url_producto = response.url
    caract = response.xpath("//section[@class='page-product-box']/table[@class='table-data-sheet']/tr/td/text()").extract()
    ficha_tecnica = []
    if len(caract) > 1:
        ficha_tecnica = list(zip(caract[0::2],caract[1::2]))
    #Genero objeto producto:
    producto = Producto_tienda()
    producto['nombre'] = nombre
    producto['descripcion'] = descripcion
    producto['url_foto'] = url_foto
    producto['precio'] = precio
    producto['id_tienda'] = 2
    producto['tienda'] = 'ARTEC'
    producto['url_producto'] = url_producto
    producto['codigo'] = codigo
    producto['categoria'] = categoria_actual
    producto['ficha_tecnica'] = ficha_tecnica
    yield producto

我省略了代码的开头,你设置名称,允许域和start_urls以避免法律问题。

每次蜘蛛抓取parse_producto函数时,我都需要执行javascript代码。有办法吗?如果您需要更多信息,我可以给您。

提前致谢!

1 个答案:

答案 0 :(得分:0)

您可以使用selenium + phantomJS和DownloaderMiddleware 以下是我为抓取谷歌撰写的quick crawler代码段。在使用驱动程序page_source返回HtmlResponse之前,您可以完成所有必须执行的操作(例如单击货币按钮)。

# -*- coding: utf-8 -*-
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from scrapy.http import HtmlResponse
from scrapy.exceptions import IgnoreRequest
import time


class SeleniumMiddleware(object):
    def __init__(self):
        self.driver = webdriver.PhantomJS()

    def process_request(self, request, spider):
        self.driver.get(request.url)
        try:
            WebDriverWait(self.driver, 5).until(EC.presence_of_element_located((By.CLASS_NAME, "r")))
            body = self.driver.page_source
            return HtmlResponse(self.driver.current_url, body=body, encoding='utf-8', request=request)
        except Exception as e:
            # Timeout on WebDriverWait
            raise IgnoreRequest

您还需要在settings.py中设置DownloaderMiddleware,例如:

DOWNLOADER_MIDDLEWARES = {
    'selenium_downloader_middleware.middlewares.SeleniumMiddleware': 723,
}