从网站抓取隐藏的部分

时间:2015-06-21 23:33:05

标签: python selenium web-crawler scrapy

我目前正试图从www.seeekingalpha.com收集文章和评论。 例如文章部分和评论here

对于文章部分,我使用Scrapywebdriverwget(也下载了一些html)很好地抓取了它们。

但是对于评论部分,事情变得艰难。

  1. 当我使用Scrapy直接访问page_source时,注释部分将被隐藏(无内容)。我在想,也许该网站将我的请求视为非浏览器,并拒绝向他们展示。

  2. 然后我使用Chromeriver(来自webdriver)访问该网站,但只有第一页回复了我一些评论,而且再也没有。

  3. 然后我注意到,当我使用帐户登录时,可以避免此问题,但我无法找到以编程方式登录的方法,要么使用25个代理执行此操作。

  4. 我想知道我是否走错了方向,是否有办法躲避所有这些问题?

1 个答案:

答案 0 :(得分:0)

它是一个使用this url填充页面的ajax请求,scrapy shell中的以下演示将帮助您获取数据。

scrapy shell 'http://seekingalpha.com/memcached2/hp_top_articles'
2015-06-22 10:43:26+0530 [scrapy] INFO: Scrapy 0.24.6 started (bot: scrapybot)
2015-06-22 10:43:26+0530 [scrapy] INFO: Optional features available: ssl, http11, boto, django
2015-06-22 10:43:26+0530 [scrapy] INFO: Overridden settings: {'LOGSTATS_INTERVAL': 0}
2015-06-22 10:43:26+0530 [scrapy] INFO: Enabled extensions: TelnetConsole, CloseSpider, WebService, CoreStats, SpiderState
2015-06-22 10:43:27+0530 [scrapy] INFO: Enabled downloader middlewares: HttpAuthMiddleware, DownloadTimeoutMiddleware, UserAgentMiddleware, RetryMiddleware, DefaultHeadersMiddleware, MetaRefreshMiddleware, HttpCompressionMiddleware, RedirectMiddleware, CookiesMiddleware, ChunkedTransferMiddleware, DownloaderStats
2015-06-22 10:43:27+0530 [scrapy] INFO: Enabled spider middlewares: HttpErrorMiddleware, OffsiteMiddleware, RefererMiddleware, UrlLengthMiddleware, DepthMiddleware
2015-06-22 10:43:27+0530 [scrapy] INFO: Enabled item pipelines: 
2015-06-22 10:43:27+0530 [scrapy] DEBUG: Telnet console listening on 127.0.0.1:6023
2015-06-22 10:43:27+0530 [scrapy] DEBUG: Web service listening on 127.0.0.1:6080
2015-06-22 10:43:27+0530 [default] INFO: Spider opened
2015-06-22 10:43:27+0530 [default] DEBUG: Crawled (200) <GET http://seekingalpha.com/memcached2/hp_top_articles> (referer: None)
[s] Available Scrapy objects:
[s]   crawler    <scrapy.crawler.Crawler object at 0x7f2896c431d0>
[s]   item       {}
[s]   request    <GET http://seekingalpha.com/memcached2/hp_top_articles>
[s]   response   <200 http://seekingalpha.com/memcached2/hp_top_articles>
[s]   settings   <scrapy.settings.Settings object at 0x7f289ebad450>
[s]   spider     <Spider 'default' at 0x7f2895e356d0>
[s] Useful shortcuts:
[s]   shelp()           Shell help (print this help)
[s]   fetch(req_or_url) Fetch request (or URL) and update local objects
[s]   view(response)    View response in a browser

In [1]: import json

In [2]: cleaned_data  = response.body.strip('SA.Pages.HP.TopArticles.onupdate(').strip(')') 

In [3]: data = json.loads(clea)
%clear        cleaned_data  

In [3]: data = json.loads(cleaned_data)

如果您打印数据,您将获得以下内容,

[{u'author_name': None,
  u'author_picture': u'http://static1.cdn-seekingalpha.com/images/users_profile/003/022/051/medium_pic.png?1379847453',
  u'comments_counts': u'13',
  u'company_name': u'BlackBerry Ltd.',
  u'id': 3273215,
  u'path': u'/article/3273215-blackberry-brace-yourself-for-another-ugly-quarter',
  u'publish_on': 1434944662,
  u'slug': u'bbry',
  u'title': u'BlackBerry: Brace Yourself For Another Ugly Quarter'},
 {u'author_name': None,
  u'author_picture': u'http://static.cdn-seekingalpha.com/images/users_profile/000/055/431/medium_pic.png?1379429224',
  u'comments_counts': u'45',
  u'company_name': None,
  u'id': 3272165,
  u'path': u'/article/3272165-weighing-the-week-ahead-what-does-the-greek-crisis-mean-for-financial-markets',
  u'publish_on': 1434863813,
  u'slug': None,
  u'title': u'Weighing The Week Ahead: What Does The Greek Crisis Mean For Financial Markets'},
 {u'author_name': None,
  u'author_picture': u'http://static1.cdn-seekingalpha.com/images/users_profile/003/854/671/medium_pic.png?1428599641',
  u'comments_counts': u'6',
  u'company_name': u'Google Inc.',
  u'id': 3272955,
  u'path': u'/article/3272955-google-a-big-test-lies-ahead-with-the-verticalization-of-youtube',
  u'publish_on': 1434908812,
  u'slug': u'goog',
  u'title': u'Google: A Big Test Lies Ahead With The Verticalization Of YouTube'},
...
...
}]
相关问题