Question

我试图从iframe获取内容，因此我将我的splash请求端点从execute更改为render.json。 Howerver，splash.wait根本不起作用。这是蜘蛛代码。

import scrapy
from scrapy_splash import SplashRequest
from scrapy.http import HtmlResponse
src="""
function main(splash, args)
  assert(splash:go(args.url))
  assert(splash:wait(10))
  return {
    html = splash:html()
  }
end

"""

class Lafarge (scrapy.Spider):
    name = "lafargespider"

    def __init__(self, *args, **kwargs): 
        self.root_url = "https://cacareers-lafarge-na.icims.com/jobs/search?pr=0&searchRelation=keyword_all&schemaId=&o="

    def start_requests(self):
           yield SplashRequest(self.root_url, self.parse_detail,
                endpoint='render.json',
                args={
                    'iframes': 1,
                    'html' : 1,
                    'lua_source': src,
                    'timeout': 90
                }
            ) 
    def parse_detail(self, response):
        #response decoded
        rs = response.data['childFrames'][0]['html']
        response = HtmlResponse(url="my HTML string", body=rs, encoding='utf-8')
        print("next page ===>",response.xpath('//a[@class="glyph "]/@href').extract_first())

Answer 1

在Splash.request参数中传递等待时间为我解决了这个问题。

def start_requests(self):
       yield SplashRequest(self.root_url, self.parse_detail,
            endpoint='render.json',
            args={
                'wait': 5,
                'iframes': 1,
                'html' : 1,
                'lua_source': src,
            }
        ) 
def parse_detail(self, response):
    rs = response.data['childFrames'][0]['html']

Answer 2

在args中传递等待参数。它应该是 -

args = { 等等＆＃39;：5，＆＃39; iframes＆＃39;：1，＆＃39; HTML＆＃39; ：1，＆＃39; lua_source＆＃39;：src，＆＃39;超时＆＃39;：90 }

Answer 3

lua_source不受类型“ render.json”的端点支持，但受类型“ execute”支持，因此代码中不需要lua_source。

解决问题的方法是使用等待，请参见第11页的等待使用说明： https://media.readthedocs.org/pdf/splash/latest/splash.pdf

当scrapy endpoint =＆＃39; render.json＆＃39;时，飞溅不会等待

3 个答案: