当页面末尾出现“加载更多”选项时,使用rvest刮取数据

时间:2016-05-31 15:00:51

标签: r web-scraping screen-scraping text-mining rvest

我正在学习网页抓取并尝试从https://www.kununu.com/us/google1/reviews抓取信息。

这是我的代码:rm(list = ls())

library(httr)
library(rvest)
library(xml2)
library(curl)

url <- "https://www.kununu.com/us/google1/reviews"

reviews <- url %>%
    read_html() %>%
    html_nodes(".panel-body")

quote <- reviews %>%
    html_nodes("h2 a") %>%
    html_text()

rating <- reviews %>%
    html_nodes(".tile-heading") %>%
    html_text()

date <- reviews %>%
    html_nodes("strong") %>%
    html_text()

a <- data.frame(quote, rating, date, stringsAsFactors = FALSE)

但是,上面的代码只会删除前十个实体。我在互联网上找到了一些关于动态网站的RSelenium包的建议。不幸的是,当我使用checkForServer(),然后是startServer()命令时,我的计算机抛出错误。当底部有LOAD MORE选项时,有没有想要一次性收集所有56条评论?

1 个答案:

答案 0 :(得分:0)

如果将鼠标悬停在Load More链接上,您会看到它只是在您的网址末尾添加一个整数。因此,只需遍历页面即可获取所有内容。首先,从提取评论的数量开始,然后获取有关的页数,然后使用您的代码获取数据......

library(httr)
library(rvest)
library(xml2)
library(curl)
library(plyr)

url <- "https://www.kununu.com/us/google1/reviews"
num_of_reviews <- read_html(url) %>%
  html_nodes(".title-number") %>%
  .[[1]] %>%
  html_text()
# round up to nearest 10s
num_of_reviews_rounded <- num_of_reviews %>%
  as.numeric() %>%
  round_any(10, f = ceiling)
pages <- 1 : (num_of_reviews_rounded / 10)

get_reviews <- function(url){
  reviews <- url %>%
    read_html() %>%
    html_nodes(".panel-body")

  quote <- reviews %>%
    html_nodes("h2 a") %>%
    html_text()

  rating <- reviews %>%
    html_nodes(".tile-heading") %>%
    html_text()

  date <- reviews %>%
    html_nodes("strong") %>%
    html_text()

  a <- data.frame(quote, rating, date, stringsAsFactors = FALSE)
  return(a)
}

list_of_dfs <- lapply(pages, function(x)get_reviews(paste0(url, "/", x)))
df <- do.call(rbind, list_of_dfs)

> str(df)
'data.frame':   56 obs. of  3 variables:
 $ quote : chr  "Exceptional: 4.13 of 5" "Noteworthy: 3.75 of 5" "Remarkable: 5.00 of 5" "Exemplary: 4.25 of 5" ...
 $ rating: chr  "\n        4.13\n    " "\n        3.75\n    " "\n        5.00\n    " "\n        4.25\n    " ...
 $ date  : chr  "Dec 30, 2015" "Dec 30, 2015" "Dec 30, 2015" "Dec 29, 2015" ...