R:在网页抓取多个页面时出现选择器的问题

时间:2020-01-23 22:24:04

标签: r web-scraping

我试图在多个页面上抓取网页,可悲的是我在选择器中遇到了问题(我使用了SelectorGadget,但没有成功)。

我仅对单个网页抓取成功

library(rvest)
points <- read_html("https://www.winemag.com/buying-guide/lagar-de-bezana-2014-aluvion-ensamblaje-red-cachapoal-valley/")

points %>% 
  html_node(".rating") %>%
  html_text() 

[1] "93points"

对于多个页面,结果不是真实值:

library(rvest)

points <- lapply(paste0('https://www.winemag.com/?s=chile&search_type=all', 1:5),
                function(url){
                    url %>% read_html() %>% 
                        html_nodes(".rating") %>% 
                        html_text()
                })
points

[[1]]
[1] "93 Points" "92 Points" "92 Points" "92 Points" "92 Points" "92 Points"

[[2]]
[1] "93 Points" "92 Points" "92 Points" "92 Points" "92 Points" "92 Points"

[[3]]
[1] "93 Points" "92 Points" "92 Points" "92 Points" "92 Points" "92 Points"

[[4]]
[1] "93 Points" "92 Points" "92 Points" "92 Points" "92 Points" "92 Points"

[[5]]
[1] "93 Points" "92 Points" "92 Points" "92 Points" "92 Points" "92 Points"

1 个答案:

答案 0 :(得分:0)

此解决方案似乎有效。我已经更改了创建网址的方式:

library(rvest)

points <- lapply(paste0('https://www.winemag.com/?s=chile&drink_type=wine&page=', 1:5, '&search_type=all3', 1:5),
                 function(url){
                   url %>% read_html() %>% 
                     html_nodes(".rating") %>% 
                     html_text()
                 })
points

我个人会这样写,尽管这当然是个人喜好:

library(rvest)

df <- tibble(url = paste0('https://www.winemag.com/?s=chile&drink_type=wine&page=', 1:5, '&search_type=all3', 1:5)) %>%
  rowwise() %>%
  mutate(
    rating = read_html(url) %>% 
      html_nodes(".rating") %>%
      html_text() %>%
      list()
  ) %>%
  unnest(cols = c(rating))