我正在学习网页抓取并尝试从https://www.kununu.com/us/google1/reviews抓取信息。
这是我的代码:rm(list = ls())
library(httr)
library(rvest)
library(xml2)
library(curl)
url <- "https://www.kununu.com/us/google1/reviews"
reviews <- url %>%
read_html() %>%
html_nodes(".panel-body")
quote <- reviews %>%
html_nodes("h2 a") %>%
html_text()
rating <- reviews %>%
html_nodes(".tile-heading") %>%
html_text()
date <- reviews %>%
html_nodes("strong") %>%
html_text()
a <- data.frame(quote, rating, date, stringsAsFactors = FALSE)
但是,上面的代码只会删除前十个实体。我在互联网上找到了一些关于动态网站的RSelenium包的建议。不幸的是,当我使用checkForServer(),然后是startServer()命令时,我的计算机抛出错误。当底部有LOAD MORE选项时,有没有想要一次性收集所有56条评论?
答案 0 :(得分:0)
如果将鼠标悬停在Load More
链接上,您会看到它只是在您的网址末尾添加一个整数。因此,只需遍历页面即可获取所有内容。首先,从提取评论的数量开始,然后获取有关的页数,然后使用您的代码获取数据......
library(httr)
library(rvest)
library(xml2)
library(curl)
library(plyr)
url <- "https://www.kununu.com/us/google1/reviews"
num_of_reviews <- read_html(url) %>%
html_nodes(".title-number") %>%
.[[1]] %>%
html_text()
# round up to nearest 10s
num_of_reviews_rounded <- num_of_reviews %>%
as.numeric() %>%
round_any(10, f = ceiling)
pages <- 1 : (num_of_reviews_rounded / 10)
get_reviews <- function(url){
reviews <- url %>%
read_html() %>%
html_nodes(".panel-body")
quote <- reviews %>%
html_nodes("h2 a") %>%
html_text()
rating <- reviews %>%
html_nodes(".tile-heading") %>%
html_text()
date <- reviews %>%
html_nodes("strong") %>%
html_text()
a <- data.frame(quote, rating, date, stringsAsFactors = FALSE)
return(a)
}
list_of_dfs <- lapply(pages, function(x)get_reviews(paste0(url, "/", x)))
df <- do.call(rbind, list_of_dfs)
> str(df)
'data.frame': 56 obs. of 3 variables:
$ quote : chr "Exceptional: 4.13 of 5" "Noteworthy: 3.75 of 5" "Remarkable: 5.00 of 5" "Exemplary: 4.25 of 5" ...
$ rating: chr "\n 4.13\n " "\n 3.75\n " "\n 5.00\n " "\n 4.25\n " ...
$ date : chr "Dec 30, 2015" "Dec 30, 2015" "Dec 30, 2015" "Dec 29, 2015" ...