怎么刮rvest?

时间:2017-09-23 22:08:43

标签: web-scraping rvest

我需要从这个页面获得三个不同的数字(黄色,见图):

https://www.scopus.com/authid/detail.uri?authorId=7006040753

我使用rvestinspectorgadget

来使用此代码
site=read_html("https://www.scopus.com/authid/detail.uri?authorId=7006040753") 
hindex=site %>% html_node(".row3 .valueColumn span")%>% html_text()
documents=site %>% html_node("#docCntLnk")%>% html_text()
citations=site %>% html_node("#totalCiteCount")%>% html_text()
print(citations)

我可以获得h-indexdocuments,但引文不起作用

你能帮助我吗?

1 个答案:

答案 0 :(得分:0)

现在我找到了一个解决方案 - 我注意到这个值花了一些时间来加载所以我在PhnatomJS脚本中包含了一点暂停。现在它可以在我的机器上使用以下R代码:

setwd("path/to/phantomjs/bin")
system('phantomjs readexample.js') # call PhantomJS script (stored in phantomjs/bin)

totalCiteCount <- "rendered_page.html" %>% # "rendered_page.html" is created by PhantomJS
   read_html() %>%
   html_nodes("#totalCiteCount") %>%
   html_text()

## totalCiteCount
## [1] "52018"

相应的PhantomJS脚本文件“readexample.js”如下所示(感谢https://www.r-bloggers.com/web-scraping-javascript-rendered-sites/):

var webPage = require('webpage');
var url ='https://www.scopus.com/authid/detail.uri?authorId=7006040753';
var fs = require('fs'); 
var page = webPage.create();
var system = require('system');

page.settings.userAgent = 'Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)';

page.open(url, function (status) {
        setTimeout(function() {
               fs.write('rendered_page.html', page.content, 'w');
            phantom.exit();
    }, 2500);
});

代码在R中抛出以下错误,但至少正确地删除了该值。

> system('phantomjs readexample.js') TypeError: undefined is not a constructor (evaluating 'mutation.addedNodes.forEach')

  https://www.scopus.com/gzip_N1846232499/bundles/SiteCatalystTop.js:73  :0 in forEach   https://www.scopus.com/gzip_N1846232499/bundles/SiteCatalystTop.js:73 ReferenceError: Can't find variable: SDM

  https://www.scopus.com/gzip_N1729184664/bundles/AuthorProfileTop.js:73 in sendIndex   https://www.scopus.com/gzip_N1729184664/bundles/AuthorProfileTop.js:67 in loadEvents

使用PhantomJS非常方便,因为您不必安装任何东西(如果您的计算机上没有管理员权限,也可以使用)。只需download .zip文件并将其解压缩到任何文件夹。然后将R(setwd())中的工作目录设置为“phantomjs / bin”文件夹,它可以正常工作。

您也可以在R中更改PhantomJS脚本(如果需要,可以迭代)在循环中将不同的URL传递给脚本。例如:

for (i in 1:n_urls) {

   url_var <- urls[i] # assuming you have created a var "urls" with multiple URLs before
   lines <- readLines("readexample.js")
   lines[2] <- paste0("var url ='", url_var ,"';") # exchange code line with new URL
   writeLines(lines, "readexample.js") # new url is stored in the PhantomJS script

   system('phantomjs readexample.js')

   # <any code> #

} 

希望这会让你更进一步?