在R / Rvest中刮取JavaScript对象并转换为JSON

时间:2017-10-19 05:13:20

标签: json r rvest

我正在抓取以下网站:https://www.banorte.com/wps/portal/ixe/Home/indicadores/tipo-de-cambio

我试图通过rvest包将货币汇率表转换为R数据框,但表本身是在HTML代码中的JavaScript变量中配置的。

我找到了相关的css选择器,现在我有了这个:

library(rvest)    
banorte <- "https://www.banorte.com/wps/portal/ixe/Home/indicadores/tipo-de-cambio/" %>%
      read_html() %>%
      html_nodes('#indicadores_financieros_wrapper > script:nth-child(2)')

我的输出现在是以下JavaScript脚本,作为XML节点集:

<script>
$(document).ready(function(){
    var valor = '{"tablaDivisas":[{"nombreDivisas":"FRANCO SUIZO","compra":"18.60","venta":"19.45"}, {"nombreDivisas":"LIBRA ESTERLINA","compra":"24.20","venta":"25.15"}, {"nombreDivisas":"YEN JAPONES","compra":"0.1635","venta":"0.171"}, {"nombreDivisas":"CORONA SUECA","compra":"2.15","venta":"2.45"}, {"nombreDivisas":"DOLAR CANADA","compra":"14.50","venta":"15.35"}, {"nombreDivisas":"EURO","compra":"21.75","venta":"22.60"}], "tablaDolar":[{"nombreDolar":"VENTANILLA","compra":"17.73","venta":"19.15"}]}';
    if(valor != '{}'){
        var objJSON = eval("(" + valor + ")");
        var tabla="<tbody>";
        for ( var i = 0; i < objJSON["tablaDolar"].length; i++) {
            tabla+= "<tr>";
            tabla+= "<td>" + objJSON["tablaDolar"][i].nombreDolar + "</td>";
            tabla+= "<td>$" + objJSON["tablaDolar"][i].compra + "</td>";
            tabla+= "<td>$" + objJSON["tablaDolar"][i].venta + "</td>";
            tabla+= "</tr>";
        }
        tabla+= "</tbody>";
        $("#tablaDolar").append(tabla);
        var tabla2="";
        for ( var i = 0; i < objJSON["tablaDivisas"].length; i++) {
            tabla2+= "<tr>";
            tabla2+= "<td>" + objJSON["tablaDivisas"][i].nombreDivisas + "</td>";
            tabla2+= "<td>$" + objJSON["tablaDivisas"][i].compra + "</td>";
            tabla2+= "<td>$" + objJSON["tablaDivisas"][i].venta + "</td>";
            tabla2+= "</tr>";
        }
        tabla2+= "</tbody>";
        $("#tablaDivisas").append(tabla2);
    }
    bmnIndicadoresResponsivoInstance.cloneResponsive(0);
});
</script>

我的问题是,如何删除几乎所有(所有JavaScript函数/运算符)以仅获取此数据,以便最终将其转换为这样的JSON表:

{"tablaDivisas":[{"nombreDivisas":"FRANCO SUIZO","compra":"18.60","venta":"19.45"},
{"nombreDivisas":"LIBRA ESTERLINA","compra":"24.20","venta":"25.15"},
{"nombreDivisas":"YEN JAPONES","compra":"0.1635","venta":"0.171"}, 
{"nombreDivisas":"CORONA SUECA","compra":"2.15","venta":"2.45"}, 
{"nombreDivisas":"DOLAR CANADA","compra":"14.50","venta":"15.35"}, 
{"nombreDivisas":"EURO","compra":"21.75","venta":"22.60"}],
"tablaDolar":[{"nombreDolar":"VENTANILLA","compra":"17.73","venta":"19.15"}]}

换句话说,我需要提取&#34; valor&#34;来自JS脚本的变量使用R。

由于某种原因,我在R中完成所有操作时遇到了麻烦(无需将变量导出为外部.txt文件,然后使用子字符串)

2 个答案:

答案 0 :(得分:1)

肯定是一个更重要的答案,但推广到其他更多的“javascript问题”。

library(rvest)
library(stringi)
library(V8)
library(tidyverse)

banorte <- "https://www.banorte.com/wps/portal/ixe/Home/indicadores/tipo-de-cambio/" %>%
      read_html() %>%
      html_nodes('#indicadores_financieros_wrapper > script:nth-child(2)')

我们将设置一个javascript V8上下文:

ctx <- v8()

然后:

  • 获取<script>内容
  • 将其拆分为行
  • 将其变成简单的字符向量
  • 删除残缺
  • 评估javascript

这不是太糟糕:

html_text(banorte) %>% 
  stri_split_lines() %>% 
  flatten_chr() %>% 
  keep(stri_detect_regex, "^\tvar") %>% 
  ctx$eval()

由于javascript是一个JSON字符串,我们在R vs V8中执行eval:

jsonlite::fromJSON(ctx$get("valor"))
## $tablaDivisas
##     nombreDivisas compra venta
## 1    FRANCO SUIZO  18.60 19.45
## 2 LIBRA ESTERLINA  24.20 25.15
## 3     YEN JAPONES 0.1635 0.171
## 4    CORONA SUECA   2.15  2.45
## 5    DOLAR CANADA  14.50 15.35
## 6            EURO  21.75 22.60
## 
## $tablaDolar
##   nombreDolar compra venta
## 1  VENTANILLA  17.73 19.15

如果在javascript中有其他有用的处理,这可以更好地概括。

注意:我的Chrome测试版频道中的谷歌翻译并没有很好地翻译网站,但我认为你非常接近违反“TérminosLegales”页面上第6项的精神,但直到我能翻译它不能完全说出来。当/如果我可以而且看起来你就是我会删除它。

答案 1 :(得分:0)

你可以这样做:

library(rvest)    
banorte <- "https://www.banorte.com/wps/portal/ixe/Home/indicadores/tipo-de-cambio/" %>%
    read_html() %>%
    html_nodes('#indicadores_financieros_wrapper > script:nth-child(2)') %>%
    as_list()

banorte_vec <- strsplit(banorte[[c(1,1)]],"\r\n")[[1]]
valor <- grep("valor = ", banorte_vec, value = T)
valor <- gsub("\tvar valor = ","",valor)
valor <- gsub("';$","",valor)
valor <- gsub("^'","",valor)

library(jsonlite)
result <- fromJSON(valor)
result

$tablaDivisas
  nombreDivisas compra venta
1    FRANCO SUIZO  18.60 19.45
2 LIBRA ESTERLINA  24.20 25.15
3     YEN JAPONES 0.1635 0.171
4    CORONA SUECA   2.15  2.45
5    DOLAR CANADA  14.50 15.35
6            EURO  21.75 22.60

$tablaDolar
  nombreDolar compra venta
1  VENTANILLA  17.73 19.15