Question

我想从“https://www.raworange.com/collections/all-clothing”网站抓取图片网址。总共有9页，所以想要从所有页面中删除图像，并且还要下载带有图像名称的URL。我试过这段代码：

library(rvest)
url <- "https://www.raworange.com/collections/all-clothing"
imgsrc <- read_html(url) %>%
  html_node(xpath = '#bc-sf-filter-products img') %>%
  html_attr('src')
imgsrc
download.file(paste0(url, imgsrc), destfile = basename(imgsrc))

这不起作用。任何帮助表示赞赏。

Answer 1

正如评论中所讨论的，这将读取页面并构建产品名称对的列表，从每个项目下的h2和产品预览中的img URL列表：

library(rvest)
library(purrr)

url <- "https://www.raworange.com/collections/all-clothing"
html <- read_html(url)
products <- html %>% html_nodes(css='div.product-preview')

products %>% map(function(product) {
  name <- product %>% html_nodes(css='h2.product_title') %>% html_text()
  imgs <- product %>% html_nodes(css='img') %>% html_attr('src')
  c(name, list(imgs))
})

我还建议你从ul＃bc-sf-filter-bottom-pagination获取其他页面的URL，但看起来这是在加载时通过脚本填充的，所以不能轻易被rvest抓取。我认为您必须在浏览器中查看分页URL并自行构建它们以便您的代码循环使用。

使用R或数据挖掘工具从在线服装网站刮取图像URL

1 个答案: