从具有多个元素和包含URL的行的数据框中下载文件

时间:2019-03-30 17:58:53

标签: r url web-scraping element screen-scraping

我从多个网站上抓取了许多URL,并将它们放在一个大列表中,该列表包含145个元素(对于每个被抓取的网站)。每个元素在称为X [[i]]的列中有90-300行。在列表中的URL中搜索“议程”一词并使用这些URL下载文档时,我接下来要做的是,

到目前为止,我的代码是:

## scrape urls  
  url_base <- "https://amsterdam.raadsinformatie.nl/sitemap/meetings/201%d"
    map_df(7:8, function(i){
    page <- read_html(sprintf(url_base, i))
    data_frame(urls = html_nodes(page, "a") %>% html_attr("href") )
    }) -> urls
    rcverg17_18 <- data.frame(urls[grep("raadscomm", urls$urls), ])

## clean data
  rcverg17_18v2 <- sub(" .*", "", rcverg17_18$urls)

## scrape urls from websites
  list <- map(rcverg17_18v2, function(url) {
    url <-  glue("https://amsterdam.raadsinformatie.nl{url}")
    read_html(url) %>%
    html_nodes("a") %>%
    html_attr("href")
    })
list2 <- lapply(list, as.data.frame)

这会提供一个很大的列表,看起来像:

list2

list2 list[145]                     List of length 145
[[1]] list[239 x 1] (S3: dataframe) A data.frame with 239 rows and 1 column
[[2]] list[139 x 1] (S3: dataframe) A data.frame with 139 rows and 1 column
[[3]] list[185 x 1] (S3: dataframe) A data.frame with 186 rows and 1 column
[[4]] list[170 x 1] (S3: dataframe) A data.frame with 170 rows and 1 column
[[.]] ...
[[.]] ...
[[.]] ...

一个元素包含不同的信息,例如:

list2[[1]] 

X[[i]]
1 #zoeken                                                                                            
2 #agenda_container                                                                                                                                                                 
3 #media_wrapper
4 ...

还有其中包含空格的URL,例如:

104            https://amsterdam.raadsinformatie.nl/document/4851596/1/ID_17-01-11_Termijnagenda_Verkeer_en_Vervoer

我要搜索的URL名称中包含“议程”的URL,然后使用这些URL下载文件。我知道我必须使用download.file()函数下载文件,但我不知道具体如何。另外,我也不知道如何在这种类型的数据框(带有元素)中搜索URL。谁能帮我完成代码?

请注意,仍必须删除单元格中的空格才能下载文件。

1 个答案:

答案 0 :(得分:1)

我们可以通过以下代码实现

# Run this code after you create list but before you create list2

data.frame2 = function(x){
  data.frame(x, stringsAsFactors = F)
}

# New code for list 2  
list2 <- lapply(list, data.frame2)

# Convert list to data frame
df = do.call(rbind.data.frame, list2)

# obtain a vector of URL's which contain the work agenda
url.vec = df[df$x %like% "agenda", ]

# Remove elements of the vector which are the string "#agenda_container" (These are not URL's)
url.vec = url.vec[url.vec != "#agenda_container"]

# Obtain all URL's which contain the string "document". These URL's allow us to fetch documents. The URL's which don't contain "document" are webpages and and can not be fetched,
url.vec = url.vec[url.vec %like% "document"]

# Set the working directory
# setwd("~/PATH WHERE YOU WOULD LIKE TO SAVE THE FILES")

# Download files in a loop
# we have to add the extension ".pdf"
# temp.name will name your files with the last part of the URL, after the last backslash ("/")

for(i in url.vec){
  temp.name = tail(unlist(strsplit(i, "\\/")), 1)
  download.file(i, destfile = paste(temp.name,".pdf") )
}

检查您的文件夹,应下载所有文件。这是我将文件下载到的临时文件夹:

Folder after downloading all documents