Scrapy爬行网站与动态路由

时间:2017-01-07 19:48:52

标签: python web-scraping scrapy

如何使用动态路由从站点中抓取所有工具

http://growthtools.io/social-media-automation-tools

当我试图

scrapy shell 'http://growthtools.io/social-media-automation-tools' 

我收到了以下结果

2017-01-07 22:43:06 [root] DEBUG: Using default logger
2017-01-07 22:43:06 [root] DEBUG: Using default logger

In [1]: view(response)

enter image description here

response对象不包含tools元素。

In [3]: In [2]: response.css('.toolsList')
Out[3]: []
In [5]: 'toolsList' in response.body
Out[5]: False

谁可以描述我如何解析http://growthtools.io/social-media-automation-tools以及为什么reponse对象不包含所有页面内容?

1 个答案:

答案 0 :(得分:0)

页面加载涉及由Scrapy不是的浏览器执行的JavaScript。您可以使用scrapy-splash来解决它,它提供了在您的Scrapy项目中使用的中间件。中间件使用您可以通过docker运行的Splash JS rendering service

至于在Scrapy Shell中测试它,您可以关注this example to run it from the shell

适合我:

$ scrapy shell 'http://localhost:8050/render.html?url=http://growthtools.io/social-media-automation-tools' 
In [1]: response.css('.toolsList')
Out[1]: 
[<Selector xpath=u"descendant-or-self::*[@class and contains(concat(' ', normalize-space(@class), ' '), ' toolsList ')]" data=u'<div class="col-md-10 col-xs-12 toolsLis'>,
 <Selector xpath=u"descendant-or-self::*[@class and contains(concat(' ', normalize-space(@class), ' '), ' toolsList ')]" data=u'<div class="col-md-10 col-xs-12 toolsLis'>,
 <Selector xpath=u"descendant-or-self::*[@class and contains(concat(' ', normalize-space(@class), ' '), ' toolsList ')]" data=u'<div class="col-md-10 col-xs-12 toolsLis'>,
 <Selector xpath=u"descendant-or-self::*[@class and contains(concat(' ', normalize-space(@class), ' '), ' toolsList ')]" data=u'<div class="col-md-10 col-xs-12 toolsLis'>,
 <Selector xpath=u"descendant-or-self::*[@class and contains(concat(' ', normalize-space(@class), ' '), ' toolsList ')]" data=u'<div class="col-md-10 col-xs-12 toolsLis'>,
 <Selector xpath=u"descendant-or-self::*[@class and contains(concat(' ', normalize-space(@class), ' '), ' toolsList ')]" data=u'<div class="col-md-10 col-xs-12 toolsLis'>,
 <Selector xpath=u"descendant-or-self::*[@class and contains(concat(' ', normalize-space(@class), ' '), ' toolsList ')]" data=u'<div class="col-md-10 col-xs-12 toolsLis'>,
 <Selector xpath=u"descendant-or-self::*[@class and contains(concat(' ', normalize-space(@class), ' '), ' toolsList ')]" data=u'<div class="col-md-10 col-xs-12 toolsLis'>,
 <Selector xpath=u"descendant-or-self::*[@class and contains(concat(' ', normalize-space(@class), ' '), ' toolsList ')]" data=u'<div class="col-md-10 col-xs-12 toolsLis'>]