我试图在多个页面上抓取网页,可悲的是我在选择器中遇到了问题(我使用了SelectorGadget,但没有成功)。
我仅对单个网页抓取成功
library(rvest)
points <- read_html("https://www.winemag.com/buying-guide/lagar-de-bezana-2014-aluvion-ensamblaje-red-cachapoal-valley/")
points %>%
html_node(".rating") %>%
html_text()
[1] "93points"
对于多个页面,结果不是真实值:
library(rvest)
points <- lapply(paste0('https://www.winemag.com/?s=chile&search_type=all', 1:5),
function(url){
url %>% read_html() %>%
html_nodes(".rating") %>%
html_text()
})
points
[[1]]
[1] "93 Points" "92 Points" "92 Points" "92 Points" "92 Points" "92 Points"
[[2]]
[1] "93 Points" "92 Points" "92 Points" "92 Points" "92 Points" "92 Points"
[[3]]
[1] "93 Points" "92 Points" "92 Points" "92 Points" "92 Points" "92 Points"
[[4]]
[1] "93 Points" "92 Points" "92 Points" "92 Points" "92 Points" "92 Points"
[[5]]
[1] "93 Points" "92 Points" "92 Points" "92 Points" "92 Points" "92 Points"
答案 0 :(得分:0)
此解决方案似乎有效。我已经更改了创建网址的方式:
library(rvest)
points <- lapply(paste0('https://www.winemag.com/?s=chile&drink_type=wine&page=', 1:5, '&search_type=all3', 1:5),
function(url){
url %>% read_html() %>%
html_nodes(".rating") %>%
html_text()
})
points
我个人会这样写,尽管这当然是个人喜好:
library(rvest)
df <- tibble(url = paste0('https://www.winemag.com/?s=chile&drink_type=wine&page=', 1:5, '&search_type=all3', 1:5)) %>%
rowwise() %>%
mutate(
rating = read_html(url) %>%
html_nodes(".rating") %>%
html_text() %>%
list()
) %>%
unnest(cols = c(rating))