我目前正试图从www.seeekingalpha.com收集文章和评论。 例如文章部分和评论here
对于文章部分,我使用Scrapy
,webdriver
和wget
(也下载了一些html)很好地抓取了它们。
但是对于评论部分,事情变得艰难。
当我使用Scrapy直接访问page_source时,注释部分将被隐藏(无内容)。我在想,也许该网站将我的请求视为非浏览器,并拒绝向他们展示。
然后我使用Chromeriver(来自webdriver)访问该网站,但只有第一页回复了我一些评论,而且再也没有。
然后我注意到,当我使用帐户登录时,可以避免此问题,但我无法找到以编程方式登录的方法,要么使用25个代理执行此操作。
我想知道我是否走错了方向,是否有办法躲避所有这些问题?
答案 0 :(得分:0)
它是一个使用this url填充页面的ajax请求,scrapy shell中的以下演示将帮助您获取数据。
scrapy shell 'http://seekingalpha.com/memcached2/hp_top_articles'
2015-06-22 10:43:26+0530 [scrapy] INFO: Scrapy 0.24.6 started (bot: scrapybot)
2015-06-22 10:43:26+0530 [scrapy] INFO: Optional features available: ssl, http11, boto, django
2015-06-22 10:43:26+0530 [scrapy] INFO: Overridden settings: {'LOGSTATS_INTERVAL': 0}
2015-06-22 10:43:26+0530 [scrapy] INFO: Enabled extensions: TelnetConsole, CloseSpider, WebService, CoreStats, SpiderState
2015-06-22 10:43:27+0530 [scrapy] INFO: Enabled downloader middlewares: HttpAuthMiddleware, DownloadTimeoutMiddleware, UserAgentMiddleware, RetryMiddleware, DefaultHeadersMiddleware, MetaRefreshMiddleware, HttpCompressionMiddleware, RedirectMiddleware, CookiesMiddleware, ChunkedTransferMiddleware, DownloaderStats
2015-06-22 10:43:27+0530 [scrapy] INFO: Enabled spider middlewares: HttpErrorMiddleware, OffsiteMiddleware, RefererMiddleware, UrlLengthMiddleware, DepthMiddleware
2015-06-22 10:43:27+0530 [scrapy] INFO: Enabled item pipelines:
2015-06-22 10:43:27+0530 [scrapy] DEBUG: Telnet console listening on 127.0.0.1:6023
2015-06-22 10:43:27+0530 [scrapy] DEBUG: Web service listening on 127.0.0.1:6080
2015-06-22 10:43:27+0530 [default] INFO: Spider opened
2015-06-22 10:43:27+0530 [default] DEBUG: Crawled (200) <GET http://seekingalpha.com/memcached2/hp_top_articles> (referer: None)
[s] Available Scrapy objects:
[s] crawler <scrapy.crawler.Crawler object at 0x7f2896c431d0>
[s] item {}
[s] request <GET http://seekingalpha.com/memcached2/hp_top_articles>
[s] response <200 http://seekingalpha.com/memcached2/hp_top_articles>
[s] settings <scrapy.settings.Settings object at 0x7f289ebad450>
[s] spider <Spider 'default' at 0x7f2895e356d0>
[s] Useful shortcuts:
[s] shelp() Shell help (print this help)
[s] fetch(req_or_url) Fetch request (or URL) and update local objects
[s] view(response) View response in a browser
In [1]: import json
In [2]: cleaned_data = response.body.strip('SA.Pages.HP.TopArticles.onupdate(').strip(')')
In [3]: data = json.loads(clea)
%clear cleaned_data
In [3]: data = json.loads(cleaned_data)
如果您打印数据,您将获得以下内容,
[{u'author_name': None,
u'author_picture': u'http://static1.cdn-seekingalpha.com/images/users_profile/003/022/051/medium_pic.png?1379847453',
u'comments_counts': u'13',
u'company_name': u'BlackBerry Ltd.',
u'id': 3273215,
u'path': u'/article/3273215-blackberry-brace-yourself-for-another-ugly-quarter',
u'publish_on': 1434944662,
u'slug': u'bbry',
u'title': u'BlackBerry: Brace Yourself For Another Ugly Quarter'},
{u'author_name': None,
u'author_picture': u'http://static.cdn-seekingalpha.com/images/users_profile/000/055/431/medium_pic.png?1379429224',
u'comments_counts': u'45',
u'company_name': None,
u'id': 3272165,
u'path': u'/article/3272165-weighing-the-week-ahead-what-does-the-greek-crisis-mean-for-financial-markets',
u'publish_on': 1434863813,
u'slug': None,
u'title': u'Weighing The Week Ahead: What Does The Greek Crisis Mean For Financial Markets'},
{u'author_name': None,
u'author_picture': u'http://static1.cdn-seekingalpha.com/images/users_profile/003/854/671/medium_pic.png?1428599641',
u'comments_counts': u'6',
u'company_name': u'Google Inc.',
u'id': 3272955,
u'path': u'/article/3272955-google-a-big-test-lies-ahead-with-the-verticalization-of-youtube',
u'publish_on': 1434908812,
u'slug': u'goog',
u'title': u'Google: A Big Test Lies Ahead With The Verticalization Of YouTube'},
...
...
}]