Question

我试图从网站上读取HTML代码以便抓取一些数据，但我收到了一个奇怪的错误。

以下是一个示例链接：www.boxofficemojo.com/movies/？id = avatar.htm

以下是代码：

library(RCurl)
library(XML)
library(rvest)

url <- paste0("www.boxofficemojo.com",movies_table[1,1])

webpage <- read_html(url)

gross_data_html <- html_nodes(webpage,".mp_box_content b")

结果：

library(RCurl)
library(XML)
library(rvest)

url <- paste0("www.boxofficemojo.com",movies_table[1,1])

webpage <- read_html(url)
> Error: 'www.boxofficemojo.com/movies/?id=avatar.htm' does not exist in current working directory ('C:/Users/Will/Documents').

gross_data_html <- html_nodes(webpage,".mp_box_content b")
> Error in html_nodes(webpage, ".mp_box_content b") : object 'webpage' not found

为什么会这样？它与.htm而不是.html？

的文件类型有关

Answer 1

如果要向read_html发送URL，则需要在＆＃34; http：//＆＃34;之前，否则该函数将假定输入是本地文件路径（不存在）。

错：

read_html('www.boxofficemojo.com/movies/?id=avatar.htm')

右：

read_html('http://www.boxofficemojo.com/movies/?id=avatar.htm')

在R

1 个答案: