Question

我正在使用R刮亚马逊网来获取产品的价格。产品有5页，所以每次我都应该输入不同的网址。这是我使用的代码：

pages<-c(1,2,3,4,5)
##getting the url of the 5 pages
urls<-rbindlist(lapply(pages,function(x){
  url<-paste("https://www.amazon.co.uk/Best-Sellers-Health-Personal-Care-Weight-Loss-Supplements/zgbs/drugstore/2826476031#",x,sep="")
  data.frame(url)
}),fill=TRUE)


product.price<-rbindlist(apply(urls,1,function(url){
  locations <- url %>%
    map(read_html) %>%
    map(html_nodes, xpath = '//*[@id="zg_centerListWrapper"]/div/div[2]/div/div[2]/span[1]/span') %>%
    map(html_text) %>%
    simplify()
  data.frame(locations)
}),fill=TRUE)

有100个产品，每页20个，我得到的是前20个重复5次。这意味着我只输入了第一个网址。如何才能访问所有页面？

感谢

Answer 1

这是我的看法：

library(rvest)

url <- 'https://www.amazon.co.uk/Best-Sellers-Health-Personal-Care-Weight-Loss-Supplements/zgbs/drugstore/2826476031#'

page <- read_html(url)

numPages <- page %>%
  html_node('.zg_pagination') %>%
  html_nodes('li') %>%
  length

items <- vector()
for(i in 1:numPages){
  url <- paste0(url, i)
  page <- read_html(url)

  item <- page %>%
    html_nodes(xpath = '//*[@id="zg_centerListWrapper"]/div/div[2]/div/a/div[2]') %>%
    html_text(trim = TRUE)

  items <- append(items, item)
}

主要差异：

我选择了循环而不是功能性方法
修改了xpath参数以获取项目名称 - 您可以轻松扩展以获取价格，星级等。

用R刮亚马逊网

1 个答案: