我正在尝试从不允许直接下载的数据库中抓取数据。我已经能够从单个物种中抓取数据,但是我正在尝试对159个物种进行采集。这就是为什么我想创建一个可能有用的循环
test <- data.frame(site = c("http://ontariofishes.ca/fish_detail.php?FID=1",
"http://ontariofishes.ca/fish_detail.php?FID=2"),
html.node = "td.DataText", stringsAsFactors = F)
library(rvest)
# an empty list, to fill with the scraped data
empty_list <- list()
for (i in 1:nrow(test)){
datatext <- pubs[i, 1]
datatext2 <- pubs[i, 2]
# scrape it!
empty_list[[i]] <- read_html(datatext) %>% html_nodes(datatext2) %>% html_text()
}
names(empty_list) <- test$site
empty <- as.data.frame(empty_list)
这是我到目前为止尝试过的。这仅适用于2种,如URL中的FID = 1和FID = 2所示。有159种。这就是为什么我想要一个从1:159开始的for循环,并使用此当前代码填充数据框。
答案 0 :(得分:1)
我能够弄清楚!
url="http://www.ontariofishes.ca/fish_detail.php?FID=1"
webpage <- read_html(url)
Data.Label <- webpage %>%
html_nodes("td.DataLabel") %>%
html_text()
Label <- as.data.frame(t(Data.Label))
#Obtains the data labels in a dataframe that is tranposed.
Data.Text <- lapply(paste0('http://ontariofishes.ca/fish_detail.php?FID=', 1:159),
function(url){
url %>% read_html() %>%
html_nodes("td.DataText") %>%
html_text()
})
#Creates a list of all the data text needed to populate the table
Eco.Table <- as.data.frame(Data.Text)
#Convert list into dataframe.
Eco.Table <- Eco.Table[-c(39:42), ]
#Remove irrelevant rows
Eco.Table <- as.data.frame(t(Eco.Table))
#Transpose the dataframe into rows
rownames(Eco.Table) <- NULL
colnames(Eco.Table) <- as.character(unlist(Label))
#Reset row names and add column labels