R Selenium getPageSource()不返回源

时间:2015-01-15 12:03:43

标签: r selenium web-scraping

早上比我的人更聪明,我有一些奇怪的问题网络刮刮Mashable.com,我希望有人可以解释一下。

Mashable的搜索页面填充了一个类似于......的块的结果。

    <script>
  window.__bootstrap = {"posts":[{"_id":"54b687d512d2cd49040027dd","id":"2015/01/14/bitcoin-price-200","title":"Bitcoin prices collapse below $200 for first time since 2013","title_tag":null,"author":"Seth Fiegerman","post_date":"2015-01-14T15:14:19+00:00","post_date_rfc":"Wed, 14 Jan 2015 15:14:19 +0000","sort_key":"1ybqcU","link":"http://mashable.com/2015/01/14/bitcoin-price-200/","content":{"plain":"Bitcoin prices are collapsing almost as quickly as they originally skyrocketed.

我克服此类渲染后问题的常用技巧是使用Selenium抓取页面但是今天事情不会计划。

通过Selenium

加载网址http://mashable.com/search/?t=stories&q=bitcoin&page=2
 remoteSelenium$navigate(uri) # send selenium to page
 html <- unlist(remoteSelenium$getPageSource()) # read in page contents

我明白了......

> html

               applicationCacheEnabled                              rotatable                          handlesAlerts                        databaseEnabled                                version 
                                "TRUE"                                "FALSE"                                 "TRUE"                                 "TRUE"                               "34.0.5" 
                              platform                           nativeEvents                         acceptSslCerts             webdriver.remote.sessionid                      webStorageEnabled 
                                 "MAC"                                "FALSE"                                 "TRUE" "ed06539a-59dc-41a5-ba4e-07b2ed9a9490"                                 "TRUE" 
                locationContextEnabled                            browserName                        takesScreenshot                      javascriptEnabled                    cssSelectorsEnabled 
                                "TRUE"                              "firefox"                                 "TRUE"                                 "TRUE"                                 "TRUE"

...而不是页面源本身。无法理解为什么或如何解决这个问题,因为它在我试过的其他任何地方都可以正常工作。有关其他问题/答案的想法或指示吗?

0 个答案:

没有答案