使用R中的RVest包和选择器小工具抓取网页

时间:2020-01-06 09:22:24

标签: r web-scraping css-selectors rvest

我正在尝试从某个站点抓取波罗的海巴拿马型指数数据。我也从其他站点上抓取了数据,但不适用于此页面。 我正在使用Office连接,而要从中下载的站点显示为“不安全”连接。这会引起问题吗?

我需要“日期”和“关闭”列,并且到目前为止已经编写了以下用于抓取的代码:

#Baltic Panamax Index
#Specifying the url for desired website to be scraped
con=url("http://marine-transportation.capitallink.com/indices/baltic_exchange_history.html?ticker=BPI","rb")

#Reading the HTML code from the website
webpage <- read_html(con)
webpage

#Using CSS selectors to scrape the date section*
date_data = html_nodes(webpage,".text .div_line:nth-child(2)")

#Converting the ranking data to text
date_data <- html_text(date_data)

#Let's have a look at the rankings*
head(date_data)

所需的输出:

Date          Close
Jan 03,2020   949
Jan 02,2020   1003

1 个答案:

答案 0 :(得分:0)

您需要在请求标头中将您的用户名作为cookie发送,以获取此页面。我发现httr软件包为发出此类请求提供了极大的灵活性。对于此站点,您将需要使用已经在该站点注册的用户名。只需在下面的代码中更改user_name字段,即可使用:

# Use the httr package to allow flexibility with http requests
library(httr)
library(rvest)

# Set username here -----
#                       |
#             ---------------------
#             |                   |
#             v                   v
user_name  <- "my.name@example.com"

# Set url we need
site  <- "http://marine-transportation.capitallink.com"
url   <- paste0(site, "/indices/baltic_exchange_history.html?ticker=BPI")

# Obtain the page we want using user name as a cookie
response <- GET(url, set_cookies(clUser_email = user_name,
                                 expires      = "Sat, 16-Sep-2051 11:30:30 GMT",
                                 `Max-Age`    = "1000000000",
                                 path         = "/",
                                 domain       = "capitallink.com"))

# Parse the HTML code from the website using rvest
webpage       <- read_html(response)
date_data     <- html_nodes(webpage, "table")
result        <- html_table(date_data[4])[[1]]

# Tidy up the result
result        <- result[-1, 2:3]
names(result) <- c("Date", "Close")

现在,我们得到了您想要的结果:

result
#>            Date   Close
#> 2  Jan 06, 2020  890.00
#> 3  Jan 03, 2020  949.00
#> 4  Jan 02, 2020 1003.00
#> 5  Dec 24, 2019 1117.00
#> 6  Dec 23, 2019 1154.00
#> 7  Dec 20, 2019 1201.00
#> 8  Dec 19, 2019 1265.00
#> 9  Dec 18, 2019 1340.00
# ....[ plus 50 more rows]....