R Webscraping RCurl和httr内容

时间:2018-06-01 20:20:18

标签: html web-scraping rvest rcurl httr

我正在学习一些关于网页编写的知识,而且我对2个软件包(httr和RCurl)有一点疑问,我试图从杂志(ISSN)获取代码。 researchgate网站和我遇到了一个情况。当通过httr和RCurl从站点中提取内容时,我在RCurl包中获得了ISSN,在httr中我的函数返回NULL,有人可以告诉我为什么这样吗?在我看来,这两个功能都是有效的。请遵循以下代码。

library(rvest)
library(httr)
library(RCurl)

url <- "https://www.researchgate.net/journal/0730-0301_Acm_Transactions_On_Graphics"

########
# httr #
########

conexao <- GET(url)
conexao_status <- http_status(conexao)
conexao_status

content(conexao, as = "text", encoding = "utf-8") %>% read_html() -> webpage1

ISSN <- webpage1 %>%
  html_nodes(xpath = '//*/div/div[2]/div[1]/div[1]/table[2]/tbody/tr[7]/td') %>%
  html_text %>%
  str_to_title() %>%
  str_split(" ") %>%
  unlist
ISSN

########
# RCurl #
########

options(RCurlOptions = list(verbose = FALSE, 
                            capath = system.file("CurlSSL", "cacert.pem", package = "RCurl"), 
                            ssl.verifypeer = FALSE))

webpage <- getURLContent(url) %>% read_html()

ISSN <- webpage %>%
  html_nodes(xpath = '//*/div/div[2]/div[1]/div[1]/table[2]/tbody/tr[7]/td') %>%
  html_text %>%
  str_to_title() %>%
  str_split(" ") %>%
  unlist
ISSN
  
    

sessionInfo()R版本3.5.0(2018-04-23)平台:x86_64-w64-mingw32 / x64(64位)在以下位置运行:Windows&gt; = 8 x64(build     9200)

  
     

Matrix产品:默认

     

locale:[1] LC_COLLATE = Portuguese_Brazil.1252   LC_CTYPE = Portuguese_Brazil.1252 LC_MONETARY = Portuguese_Brazil.1252   [4] LC_NUMERIC = C LC_TIME = Portuguese_Brazil.1252

     

附加基础包:[1] stats graphics grDevices utils
  数据集方法基础

     

其他附件包:[1] testit_0.7 dplyr_0.7.4
  progress_1.1.2 readxl_1.1.0 stringr_1.3.0 RCurl_1.95-4.10   bitops_1.0-6 [8] httr_1.3.1 rvest_0.3.2 xml2_1.2.0
  jsonlite_1.5

     

通过命名空间加载(而不是附加):[1] Rcpp_0.12.16
  bindr_0.1.1 magrittr_1.5 R6_2.2.2 rlang_0.2.0
  tools_3.5.0 [7] yaml_2.1.19 assertthat_0.2.0   tibble_1.4.2 bindrcpp_0.2.2 curl_3.2 glue_1.2.0
  [13] stringi_1.1.7 pillar_1.2.2 compiler_3.5.0
  cellranger_1.1.0 prettyunits_1.0.2 pkgconfig_2.0.1

1 个答案:

答案 0 :(得分:1)

由于内容类型是JSON而非HTML,因此您无法在其上使用read_html()

> conexao
Response [https://www.researchgate.net/journal/0730-0301_Acm_Transactions_On_Graphics]
Date: 2018-06-02 03:15
Status: 200
Content-Type: application/json; charset=utf-8
Size: 328 kB

使用fromJSON()代替提取issn:

library(jsonlite)
result <- fromJSON(content(conexao, as = "text", encoding = "utf-8") )
result$result$data$journalFullInfo$data$issn

结果:

> result$result$data$journalFullInfo$data$issn
[1] "0730-0301"