Question

我想从以下网页抓取电影名称：https://www.rottentomatoes.com/browse/in-theaters/，但是返回的列表为空。

以下是我的代码：

html <- read_html("https://www.rottentomatoes.com/browse/in-theaters/")
movies <- html_nodes(html, ".movieTitle")

.movieTitle是html类。

Answer 1

我尝试使用V8软件包：

ct <-v8（）电影<-页面％>％html_nodes（'。movieTitle'）％>％html_nodes（'script'）％>％html_text（）

但是movie变量包含一个空字符。我的代码是否错误，还是暗示电影名称未使用JS渲染？

Answer 2

易勤，您在v8调用中未运行任何JS代码。 V8允许您在R中运行javascript函数，并将结果作为json返回，您需要将其解析回R。请看此处：https://cran.r-project.org/web/packages/V8/vignettes/v8_intro.html

也就是说，您可以自己检查通过rvest捕获的任何html代码是否包含您想要的内容。

您可以从捕获的列表（“ html”）中列出所有DOM节点

 library(rvest)
 library(tidyverse)
 html <- read_html("https://www.rottentomatoes.com/browse/in-theaters/")
 movies <- html_nodes(html, ".movieTitle")

  html %>% html_structure()

您还可以将捕获的html列表写入本地文件，然后在浏览器中打开文件：

write_html(html, "name_of_file.html")
browseURL('name_of_file.html')

Web抓取，R返回空

2 个答案: