我对编码非常陌生,并且正在尝试构建一个Web爬虫。我正在使用Lua脚本,以使我的请求能够等待任何网络元素(不在乎哪个元素,我只需要初始页面加载器来完成加载,以便可以访问html元素)就可以在JavaScript之后显示网站上的已加载。我要访问的特定网站是https://www.ladbrokes.com.au/sports/basketball/usa/nba,在加载网站上的任何元素之前,该网站都有一个JS初始加载器页面
我当前的代码是这样:
class Ladbrokes(scrapy.Spider):
name = 'Ladbrokes'
allowed_domains = ['ladbrokes.com.au']
start_urls = ['https://www.ladbrokes.com.au/sports']
def parse (self, response):
sports_link = select_ladbrokes(response)
for link in sports_link:
url = response.urljoin(link)
yield SplashRequest(url = url, callback =self.ladbrokes_all_comps,endpoint='execute',
args={'lua_source':lua_script})
def ladbrokes_all_comps(self, response):
comps = response.xpath('//*[@id="accordion_4e099d27-0f11-4c6e-848e-965fff7ad995"]/div[2]/div[2]/div[1]/div[2]/div[1]/div/div[1]/text()').extract()
lua_script = '''
function main(splash)
assert(splash:go(splash.args.url))
while not splash:select('#page-content-left > div > div') do
splash:wait(0.1)
end
return {html=splash:html()}
end '''
当我打电话给我的蜘蛛时,我最终遇到了以下错误:
2019-11-25 16:41:30 [scrapy.core.engine] DEBUG: Crawled (504) <GET https://www.ladbrokes.com.au/sports/nrl via http://0.0.0.0:8050/execute> (referer: None)
2019-11-25 16:41:30 [scrapy.spidermiddlewares.httperror] INFO: Ignoring response <504 https://www.ladbrokes.com.au/sports/nrl>: HTTP status code is not handled or not allowed
似乎在Lua脚本的While循环上超时了,但是我不确定是否是因为我试图错误地选择网络元素。
我还尝试在SplashRequest函数中放入一个较长的启动等待参数,但是似乎初始页面加载器从未完成加载。任何帮助都会很棒!