下载网站列表中的图片

时间:2017-10-26 14:08:40

标签: r web-scraping

我有一个如下所示的数据框:

urls <- data.frame(c("https://ec.europa.eu/consumers/consumers_safety/safety_products/rapex/alerts/?event=viewProduct&reference=1212/08", 
              "https://ec.europa.eu/consumers/consumers_safety/safety_products/rapex/alerts/?event=viewProduct&reference=1212/09", 
              "https://ec.europa.eu/consumers/consumers_safety/safety_products/rapex/alerts/?event=viewProduct&reference=1213/07", 
              "https://ec.europa.eu/consumers/consumers_safety/safety_products/rapex/alerts/?event=viewProduct&reference=1213/08", 
              "https://ec.europa.eu/consumers/consumers_safety/safety_products/rapex/alerts/?event=viewProduct&reference=1213/09", 
              "https://ec.europa.eu/consumers/consumers_safety/safety_products/rapex/alerts/?event=viewProduct&reference=1214/07", 
              "https://ec.europa.eu/consumers/consumers_safety/safety_products/rapex/alerts/?event=viewProduct&reference=1214/08", 
              "https://ec.europa.eu/consumers/consumers_safety/safety_products/rapex/alerts/?event=viewProduct&reference=1214/09"))

要下载每个网站的每张图片,我在StackOverflow中的一些人的帮助下创建了这段代码:

library(rvest)
library(dplyr)

for (url in urls) {

  webpage <- html_session(url)
  link.titles <- webpage %>% html_nodes("img")
  img.url <- link.titles %>% html_attr("src")
  download.file(img.url, url, ".jpg", mode = "wb")

}

但是,它会返回此错误:

Error: is.character(url) is not TRUE

奇怪的是,在没有循环功能的情况下运行它可以正常工作:

url <- "https://ec.europa.eu/consumers/consumers_safety/safety_products/rapex/alerts/?event=viewProduct&reference=1692/09"
webpage <- html_session(url)
link.titles <- webpage %>% html_nodes("img")
img.url <- link.titles %>% html_attr("src")
download.file(img.url, "test.jpg", mode = "wb")

我希望下载每个网站的每张图片。

2 个答案:

答案 0 :(得分:1)

我认为它是在数据框中读取您的网址作为因素 - 您需要{{1}}这样,

{{1}}

答案 1 :(得分:1)

这样可行,但看起来每张图片都是一样的,不确定这是不是意图。

urls <- c("https://ec.europa.eu/consumers/consumers_safety/safety_products/rapex/alerts/?event=viewProduct&reference=1212/08", 
                     "https://ec.europa.eu/consumers/consumers_safety/safety_products/rapex/alerts/?event=viewProduct&reference=1212/09", 
                     "https://ec.europa.eu/consumers/consumers_safety/safety_products/rapex/alerts/?event=viewProduct&reference=1213/07", 
                     "https://ec.europa.eu/consumers/consumers_safety/safety_products/rapex/alerts/?event=viewProduct&reference=1213/08", 
                     "https://ec.europa.eu/consumers/consumers_safety/safety_products/rapex/alerts/?event=viewProduct&reference=1213/09", 
                     "https://ec.europa.eu/consumers/consumers_safety/safety_products/rapex/alerts/?event=viewProduct&reference=1214/07", 
                     "https://ec.europa.eu/consumers/consumers_safety/safety_products/rapex/alerts/?event=viewProduct&reference=1214/08", 
                     "https://ec.europa.eu/consumers/consumers_safety/safety_products/rapex/alerts/?event=viewProduct&reference=1214/09")

for (url in 1:length(urls)) {

    print(url)
    webpage <- html_session(urls[url])
    link.titles <- webpage %>% html_nodes("img")
    img.url <- link.titles %>% html_attr("src")
    download.file(img.url, paste0(url,".jpg"), mode = "wb")

}

我将网址从数据框更改为字符向量,如果您想将其保存在df中,请执行以下操作:

for(i in 1:nrow(urls_df)){...}

然后必须在身体中像这样引用它

webpage <- html_session(urls_df[i,1]) # Refers to the i'th row column 1

我还将参数更改为download.file,这与您的循环不同于单一解决方案。

下载所有图片:

for (url in 1:length(urls)) {

    print(url)
    webpage <- html_session(urls[url])
    link.titles <- webpage %>% html_nodes("img")
    img.url <- link.titles %>% html_attr("src")

    for(j in 1:length(img.url)){

        download.file(img.url[j], paste0(url,'.',j,".jpg"), mode = "wb")
    }

}

如果您只想查看正文中的图像,查看结构,则可以创建if条件,仅在length(img.url) > 1

时启动下载