我从多个网站上抓取了许多URL,并将它们放在一个大列表中,该列表包含145个元素(对于每个被抓取的网站)。每个元素在称为X [[i]]的列中有90-300行。在列表中的URL中搜索“议程”一词并使用这些URL下载文档时,我接下来要做的是,
到目前为止,我的代码是:
## scrape urls
url_base <- "https://amsterdam.raadsinformatie.nl/sitemap/meetings/201%d"
map_df(7:8, function(i){
page <- read_html(sprintf(url_base, i))
data_frame(urls = html_nodes(page, "a") %>% html_attr("href") )
}) -> urls
rcverg17_18 <- data.frame(urls[grep("raadscomm", urls$urls), ])
## clean data
rcverg17_18v2 <- sub(" .*", "", rcverg17_18$urls)
## scrape urls from websites
list <- map(rcverg17_18v2, function(url) {
url <- glue("https://amsterdam.raadsinformatie.nl{url}")
read_html(url) %>%
html_nodes("a") %>%
html_attr("href")
})
list2 <- lapply(list, as.data.frame)
这会提供一个很大的列表,看起来像:
list2
list2 list[145] List of length 145
[[1]] list[239 x 1] (S3: dataframe) A data.frame with 239 rows and 1 column
[[2]] list[139 x 1] (S3: dataframe) A data.frame with 139 rows and 1 column
[[3]] list[185 x 1] (S3: dataframe) A data.frame with 186 rows and 1 column
[[4]] list[170 x 1] (S3: dataframe) A data.frame with 170 rows and 1 column
[[.]] ...
[[.]] ...
[[.]] ...
一个元素包含不同的信息,例如:
list2[[1]]
X[[i]]
1 #zoeken
2 #agenda_container
3 #media_wrapper
4 ...
还有其中包含空格的URL,例如:
104 https://amsterdam.raadsinformatie.nl/document/4851596/1/ID_17-01-11_Termijnagenda_Verkeer_en_Vervoer
我要搜索的URL名称中包含“议程”的URL,然后使用这些URL下载文件。我知道我必须使用download.file()函数下载文件,但我不知道具体如何。另外,我也不知道如何在这种类型的数据框(带有元素)中搜索URL。谁能帮我完成代码?
请注意,仍必须删除单元格中的空格才能下载文件。
答案 0 :(得分:1)
我们可以通过以下代码实现
# Run this code after you create list but before you create list2
data.frame2 = function(x){
data.frame(x, stringsAsFactors = F)
}
# New code for list 2
list2 <- lapply(list, data.frame2)
# Convert list to data frame
df = do.call(rbind.data.frame, list2)
# obtain a vector of URL's which contain the work agenda
url.vec = df[df$x %like% "agenda", ]
# Remove elements of the vector which are the string "#agenda_container" (These are not URL's)
url.vec = url.vec[url.vec != "#agenda_container"]
# Obtain all URL's which contain the string "document". These URL's allow us to fetch documents. The URL's which don't contain "document" are webpages and and can not be fetched,
url.vec = url.vec[url.vec %like% "document"]
# Set the working directory
# setwd("~/PATH WHERE YOU WOULD LIKE TO SAVE THE FILES")
# Download files in a loop
# we have to add the extension ".pdf"
# temp.name will name your files with the last part of the URL, after the last backslash ("/")
for(i in url.vec){
temp.name = tail(unlist(strsplit(i, "\\/")), 1)
download.file(i, destfile = paste(temp.name,".pdf") )
}
检查您的文件夹,应下载所有文件。这是我将文件下载到的临时文件夹: