我最近开始使用scrapy,当我抓取页面时,我遇到了显示的货币问题。这是我需要点击以更改实际货币的页面部分:
我可以这样做,但使用下一个javascript脚本: setCurrency(1) 我必须在所有页面中执行此操作,以确保每个产品使用相同的货币。从我读过的内容来看,我可以使用splash with scrapy来做到这一点,但是,如果我错了,请纠正我,但代码只能运行蜘蛛的构造函数一次。 有没有办法只在提取产品的解析上运行它?我的代码现在是这样的:
def parse(self, response):
rutas = response.xpath("//a[@class='nonblock nontext rounded-corners rgba-background clip_frame grpelem']/@href").extract()
for ruta in rutas:
ruta_abs = response.urljoin(ruta)
yield scrapy.Request(url=ruta_abs, callback=self.parse_producto)
def parse_producto(self, response):
#Se debe ejecutar antes un script para cambiar la divisa
nombre = response.xpath("//h1/text()").extract_first()
#También conocido como "Referencia" por la página:
codigo = response.xpath("//p[@id='product_reference']/span/text()").extract_first()
descripcion = response.xpath("//div[@id='short_description_block']/div/p/text()").extract_first()
url_foto = response.xpath("//div[@id='image-block']/span/img/@src").extract_first()
precio = '.'.join(response.xpath("//span[@id='our_price_display']/text()").re(r"\d+"))
categorias = response.xpath("//span[@class='navigation_page']/span/a/span/text()").extract()
categoria_actual = ''
num_categorias = len(categorias)
if num_categorias > 1:
num_categorias = (num_categorias-1)*-1
categoria_actual = categorias[num_categorias]
url_producto = response.url
caract = response.xpath("//section[@class='page-product-box']/table[@class='table-data-sheet']/tr/td/text()").extract()
ficha_tecnica = []
if len(caract) > 1:
ficha_tecnica = list(zip(caract[0::2],caract[1::2]))
#Genero objeto producto:
producto = Producto_tienda()
producto['nombre'] = nombre
producto['descripcion'] = descripcion
producto['url_foto'] = url_foto
producto['precio'] = precio
producto['id_tienda'] = 2
producto['tienda'] = 'ARTEC'
producto['url_producto'] = url_producto
producto['codigo'] = codigo
producto['categoria'] = categoria_actual
producto['ficha_tecnica'] = ficha_tecnica
yield producto
我省略了代码的开头,你设置名称,允许域和start_urls以避免法律问题。
每次蜘蛛抓取parse_producto函数时,我都需要执行javascript代码。有办法吗?如果您需要更多信息,我可以给您。
提前致谢!
答案 0 :(得分:0)
您可以使用selenium + phantomJS和DownloaderMiddleware 以下是我为抓取谷歌撰写的quick crawler代码段。在使用驱动程序page_source返回HtmlResponse之前,您可以完成所有必须执行的操作(例如单击货币按钮)。
# -*- coding: utf-8 -*-
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from scrapy.http import HtmlResponse
from scrapy.exceptions import IgnoreRequest
import time
class SeleniumMiddleware(object):
def __init__(self):
self.driver = webdriver.PhantomJS()
def process_request(self, request, spider):
self.driver.get(request.url)
try:
WebDriverWait(self.driver, 5).until(EC.presence_of_element_located((By.CLASS_NAME, "r")))
body = self.driver.page_source
return HtmlResponse(self.driver.current_url, body=body, encoding='utf-8', request=request)
except Exception as e:
# Timeout on WebDriverWait
raise IgnoreRequest
您还需要在settings.py中设置DownloaderMiddleware,例如:
DOWNLOADER_MIDDLEWARES = {
'selenium_downloader_middleware.middlewares.SeleniumMiddleware': 723,
}