从多个TripAdvisor结果页面中提取数据

时间:2017-11-28 17:25:25

标签: r web-scraping tripadvisor

我试图从使用rvest跨越多个页面的TripAdvisor搜索结果中搜索数据。

这是我的代码:

library(rvest)

starturl <- 'https://www.tripadvisor.co.uk/Search?q=swim+with&uiOrigin=trip_search_Attractions&searchSessionId=CA54193AF19658CB1D983934FB5C86F41511875967385ssid#&ssrc=A&o=0'

swimwith <- read_html(starturl)

swdf <- swimwith %>%
html_nodes('.title span') %>%
html_text() 

它适用于第一页结果,但我无法弄清楚如何从后续页面获取结果。我注意到url的结尾表示结果的开始位置,所以我从&#39; 0&#39;到&#39; 30&#39;如下:

url <- sub('A&o=0', paste0('A&o=', '30'), starturl)

webpage <- html_session(url)
swimwith <- read_html(webpage)

swdf2 <- swimwith %>%
html_nodes('.title span') %>%
html_text() 

但是,swdf2的结果与swdf相同,即使网址在网络浏览器中加载了第二页结果。

知道如何从后续页面中获取结果吗?

1 个答案:

答案 0 :(得分:0)

我想你想要这样的东西。

jump <- seq(0, 300, by = 30)
site <- paste('https://www.tripadvisor.co.uk/Search?q=swim+with&uiOrigin=trip_search_Attractions&searchSessionId=CA54193AF19658CB1D983934FB5C86F41511875967385ssid#&ssrc=A&o=', jump, sep="")

dfList <- lapply
(site, function(i) 
{

  swimwith <- read_html(i)

  swdf <- swimwith %>%
  html_nodes('.title span') %>%
  html_text()


}
)

finaldf <- do.call(rbind, dfList) 

它在我的办公室没有工作,因为防火墙阻止了它,但我认为这应该适合你。

另外,请看下面的链接。

https://rpubs.com/ryanthomas/webscraping-with-rvest

loop across multiple urls in r with rvest