Question

我的动态加载内容的网页存在抓取问题。我用：

启动了启动码头图像

ImageButton

我的scrapy-splash蜘蛛使用LUA脚本，该脚本应该滚动并返回整页的HTML：

docker run -p 8050:8050 scrapinghub/splash --disable-private-mode

使用chrome dev工具查看时，我import scrapy from scrapy_splash import SplashRequest class MySplashSpider(scrapy.Spider): # requires the scrapy-splash docker image running name = "psplash" def __init__(self): self.domain = 'http://www.phillips.com' self.user_agent = "Mozilla/5.0 (Macintosh; Intel Mac OS X 10.6; rv:10.0) Gecko/20100101 Firefox/10.0" self.script = """ function main(splash) local num_scrolls = 3 local scroll_delay = 1.0 splash:set_viewport_full() splash:wait(5.0) return splash:html() end """ self.splash_args = {'lua_source': self.script, 'ua': self.user_agent } def start_requests(self): base_url = "https://www.phillips.com/auctions/past/filter/Department=20TH%20CENTURY%20%26%20CONTEMPORARY%20ART!Editions!Latin%20America!Photographs" yield SplashRequest(base_url, callback = self.parse_pagination, endpoint = 'execute', args = self.splash_args ) def parse_pagination(self, response): print('xxxxxxxxxx', response.xpath("//footer/ul/li[last()-1]/a/text()").extract()) print('xxxxxxxxxx', response.xpath("//h2/a/@href").extract())获取了29 为什么我//footer/ul/li[last()-1]/a/text()没有得到任何结果：

response.xpath

控制台输出显示没有错误：

[
{"response_text": "hello world", "response_xpath_value": []}
]

我在这里想念的是什么？

Answer 1

Save the response body in HTML file and then check if you are getting the full page downloaded especially as per your requirements. If yes try using selectors

scrapy飞溅渲染js页面的问题

1 个答案: