花了一段时间,但我终于明白了差异在哪里!
使用网址https://www.meetup.com/Google-Cloud_Meetup_Singapore_by_Cloud-Ace/events/264513425/attendees/
抓取抓取的MeetupGetParticipants[s] scrapy scrapy module (contains scrapy.Request, scrapy.Selector, etc)
[s] crawler <scrapy.crawler.Crawler object at 0x04E0BD30>
[s] item {}
[s] request <GET http://meetup.com/Google-Cloud_Meetup_Singapore_by_Cloud-Ace/events/264513425/attendees/ via http://localhost:8050/render.html>
[s] response <200 http://meetup.com/Google-Cloud_Meetup_Singapore_by_Cloud-Ace/events/264513425/attendees/>
[s] settings <scrapy.settings.Settings object at 0x04E0BC70>
[s] spider <MeetupGetParticipants 'MeetupGetParticipants' at 0x4ff0450>
为什么Splash返回原始URL? Splash的目的不是返回render.html渲染的那个吗?我想要的是http://localhost:8050/render.html?url=https://www.meetup.com/Google-Cloud_Meetup_Singapore_by_Cloud-Ace/events/264513425/attendees/的结果(它为我提供了呈现的网页)。
基本上,我可以自己欺骗URL来使它工作……这里有些我不理解的东西。
答案 0 :(得分:0)
看起来我可以使它与一个不错的lua脚本一起工作:)它返回渲染的json响应,其中包含我需要的所有内容。
def start_requests(self):
lua_script = """
function main(splash)
assert(splash:go(splash.args.url))
while not splash:select('.attendee-item') do
splash:wait(0.1)
end
return {html=splash:html()}
end
"""
yield SplashRequest(url=self.url, callback=self.parse,
endpoint='execute',
args={'lua_source': lua_script,
'wait': 5,
},
)