Question

我使用Scrapy和Splash来抓取此页面：https://www.athleteshop.nl/shimano-voor-as-108mm-37184

这是我在Scrapy Shell中使用视图（响应）获得的图像： scrapy shell img

我需要用红色突出显示条形码。但它是在javascript中生成的，因为它可以在Chrome中使用F12的源代码中看到。但是，尽管在Scrapy Shell和Splash localhost中都能正确显示，虽然Splash localhost为我提供了正确的html，但我想要选择的条形码始终等于无与 response.xpath（＆＃34） ; //表[@class =＆＃39;数据表＆＃39;] // tr [@class =＆＃39;偶数＆＃39;] / td [@class =＆＃39;数据最后＆＃39; ] /文本（）＆＃34;）extract_first（）的。

选择器不是问题，因为它适用于Chrome的源代码。我已经在网上寻找答案了两天，似乎没有人遇到同样的问题。是不是Splash不支持它？设置如下经典：

SPLASH_URL = 'http://192.168.99.100:8050/'
DOWNLOADER_MIDDLEWARES = {
'scrapy_splash.SplashCookiesMiddleware': 723,
'scrapy_splash.SplashMiddleware': 725,
'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware': 
810,
}
SPIDER_MIDDLEWARES = {
'scrapy_splash.SplashDeduplicateArgsMiddleware': 100,
}
DUPEFILTER_CLASS = 'scrapy_splash.SplashAwareDupeFilter'
HTTPCACHE_STORAGE = 'scrapy_splash.SplashAwareFSCacheStorage'

我的代码如下（解析部分旨在点击网站内搜索引擎提供的链接。工作正常）：

    def parse(self, response):
        try :
            link=response.xpath("//li[@class='item last']/a/@href").extract_first()
            yield SplashRequest(link, self.parse_item, endpoint = 'render.html', args={'wait': 20})
        except Exception as e:
            print (str(e))


    def parse_item(self, response):
        product = {}
        product['name']=response.xpath("//div[@class='product-name']/h1/text()").extract_first()
        product['ean']=response.xpath("//table[@class='data-table']//tr[@class='even']/td[@class='data last']/text()").extract_first()
        product['price']=response.xpath("//div[@class='product-shop']//p[@class='special-price']/span[@class='price']/text()").extract_first()
        product['image']=response.xpath("//div[@class='item image-photo']//img[@class='owl-product-image']/@src").extract_first()
        print (product['name'])
        print (product['ean'])
        print (product['image'])

名称和图片网址上的打印效果非常好，因为它们不是由javascript生成的。代码没问题，设置很好，Splash localhost向我展示了一些不错的东西，但我的选择器不会在脚本的执行中工作（它没有显示任何错误），在Scrapy Shell中也没有。

问题可能是Scrapy Splash立即渲染而不关心等待时间（20秒！）。我做错了什么，拜托？

提前致谢。

Answer 1

在我看来，条形码字段的内容是动态生成的，我可以在页面源中看到它，并使用response.css('.data-table tbody tr:nth-child(2) td:nth-child(2)::text').extract_first()从scrapy shell中提取。

Scrapy Splash不尊重渲染＆＃34;等待＆＃34;时间

1 个答案: