我正在尝试编写一个R脚本,用于从站点上的多个页面中删除表中的数据。为此,我想首先创建一个要删除的特定页面的列表。要删除的页面的地址遵循格式“www.urlpart1 / [year] / urlpart2 / [page]”,其中[year]是2003到2015的范围(13个元素),[page]的值为1到281增量为40(8个元素);最终,我想要的最终列表将有104个元素。这是我的代码:
#specify components of URLs
url1 <- "www.urlpart1/"
url2 <- "/urlpart2/"
#specify range of years to scrape
years <- as.list(seq(from = 2003, to = 2015, by = 1)) #13 elements
#specify specific pages within each year to scrape
pages <- as.list(seq(from = 1, to = 281, by = 40)) #8 elements
#specify length of final list of URLs for scraping
loops <- as.list(seq(from = 1, to = (length(years)*length(pages)), by = 1)) #104 elements
#create empty list for storing output of for-loop
list1 <- list()
#initialize loop
for (i in loops){
for (j in years){
for (k in pages){
list1[[i]] <- paste0(url1,j,url2,k)
}
}
}
list1 #outputs 104 elements of last iteration of loop
最终,该列表将包含104个如下所示的元素:
"www.urlpart1/2003/urlpart2/1",
"www.urlpart1/2003/urlpart2/41",
"www.urlpart1/2003/urlpart2/81",
"www.urlpart1/2003/urlpart2/121",
"www.urlpart1/2003/urlpart2/161",
"www.urlpart1/2003/urlpart2/201",
"www.urlpart1/2003/urlpart2/241",
"www.urlpart1/2003/urlpart2/281",
"www.urlpart1/2004/urlpart2/1",
"www.urlpart1/2004/urlpart2/41",
"www.urlpart1/2004/urlpart2/81",
"www.urlpart1/2004/urlpart2/121",
"www.urlpart1/2004/urlpart2/161",
"www.urlpart1/2004/urlpart2/201",
"www.urlpart1/2004/urlpart2/241",
"www.urlpart1/2004/urlpart2/281",
...
"www.urlpart1/2015/urlpart2/1",
"www.urlpart1/2015/urlpart2/41",
"www.urlpart1/2015/urlpart2/81",
"www.urlpart1/2015/urlpart2/121",
"www.urlpart1/2015/urlpart2/161",
"www.urlpart1/2015/urlpart2/201",
"www.urlpart1/2015/urlpart2/241",
"www.urlpart1/2015/urlpart2/281"
不幸的是,我得到了一个正确长度的列表,但所有值都是循环的最后一次迭代。解决类似问题的先前线程似乎没有解决在嵌套循环中写入列表的问题。我对完全不依赖于for循环的解决方案持开放态度。我可以使用Excel的GUI轻松完成此操作,但我需要提高我的编码技能,以使其更容易重现。谢谢!
答案 0 :(得分:1)
我们可以使用expand.grid
创建所有变量的组合,以获得data.frame
输出,然后paste
每行data.frame(do.call(paste0,
)和将其转换为vector
。
res <- do.call(paste0,expand.grid(url1, years, url2, pages))
length(res)
#[1] 104
如果我们需要for
循环,可能会有帮助
v1 <- c()
for(i in seq_along(url1)){
for(j in seq_along(years)){
for(k in seq_along(url2)){
for(m in seq_along(pages)){
v1 <- c(v1, paste0(url1[i], years[[j]], url2[k], pages[[m]]))
}
}
}
}
identical(sort(res), sort(v1))
#[1] TRUE