在R中从html

时间:2016-12-10 16:23:17

标签: r web-scraping rvest

有没有办法使用rvest(或任何其他包)从网站中提取变量声明,例如

var global_tmp_status   =   0;

var global_goal_scored_overtime = [
      ['x', 'Headed', 'Left foot', 'Right foot', 'Other', 'Overall'],
      ['14/8/2016', 1,  0,  2,  0,  3]]; </script

我想将global_goal_scored_overtime中的数据作为表格提取?

由于

1 个答案:

答案 0 :(得分:3)

您可以通过优秀的V8软件包对此进行评估,如下所示:

require(rvest)
require(V8)
txt <- "<!DOCTYPE html>
<html>
<body>

<script>
var global_tmp_status = 0;
var global_goal_scored_overtime = [ ['x', 'Headed', 'Left foot', 'Right foot', 'Other', 'Overall'], ['14/8/2016', 1, 0, 2, 0, 3]];
</script> 

</body>
</html>"
# probably you need another selector to "find" your script...
script <- read_html(txt) %>% html_node("script") %>% html_text(trim=TRUE)
ctx <- v8()
ctx$eval(script)
ctx$get("global_tmp_status")
ctx$get("global_goal_scored_overtime")

导致:

> ctx$get("global_tmp_status")
[1] 0

> ctx$get("global_goal_scored_overtime")
     [,1]        [,2]     [,3]        [,4]         [,5]    [,6]     
[1,] "x"         "Headed" "Left foot" "Right foot" "Other" "Overall"
[2,] "14/8/2016" "1"      "0"         "2"          "0"     "3"