我想应用一个循环从R中的多个网页中抓取数据。我能够抓取一个网页的数据,但是当我尝试使用多个页面的循环时,我得到一个令人沮丧的错误。我花了几个小时修补,但没有用。任何帮助将不胜感激!!!
这有效:
###########################
# GET COUNTRY DATA
###########################
library("rvest")
site <- paste("http://www.countryreports.org/country/","Norway",".htm", sep="")
site <- html(site)
stats<-
data.frame(names =site %>% html_nodes(xpath="//*/td[1]") %>% html_text() ,
facts =site %>% html_nodes(xpath="//*/td[2]") %>% html_text() ,
stringsAsFactors=FALSE)
stats$country <- "Norway"
stats$names <- gsub('[\r\n\t]', '', stats$names)
stats$facts <- gsub('[\r\n\t]', '', stats$facts)
View(stats)
但是,当我尝试在循环中写入时,我收到错误
###########################
# ATTEMPT IN A LOOP
###########################
country<-c("Norway","Sweden","Finland","France","Greece","Italy","Spain")
for(i in country){
site <- paste("http://www.countryreports.org/country/",country,".htm", sep="")
site <- html(site)
stats<-
data.frame(names =site %>% html_nodes(xpath="//*/td[1]") %>% html_text() ,
facts =site %>% html_nodes(xpath="//*/td[2]") %>% html_text() ,
stringsAsFactors=FALSE)
stats$country <- country
stats$names <- gsub('[\r\n\t]', '', stats$names)
stats$facts <- gsub('[\r\n\t]', '', stats$facts)
stats<-rbind(stats,stats)
stats<-stats[!duplicated(stats),]
}
错误:
Error: length(url) == 1 is not TRUE
In addition: Warning message:
In if (grepl("^http", x)) { :
the condition has length > 1 and only the first element will be used
答案 0 :(得分:5)
最终工作代码:
###########################
# THIS WORKS!!!!
###########################
country<-c("Norway","Sweden","Finland","France","Greece","Italy","Spain")
for(i in country){
site <- paste("http://www.countryreports.org/country/",i,".htm", sep="")
site <- html(site)
stats<-
data.frame(names =site %>% html_nodes(xpath="//*/td[1]") %>% html_text() ,
facts =site %>% html_nodes(xpath="//*/td[2]") %>% html_text() ,
stringsAsFactors=FALSE)
stats$nm <- i
stats$names <- gsub('[\r\n\t]', '', stats$names)
stats$facts <- gsub('[\r\n\t]', '', stats$facts)
#stats<-stats[!duplicated(stats),]
all<-rbind(all,stats)
}
View(all)
答案 1 :(得分:1)
在循环之前初始化空数据帧。 我已经完成了这个问题,以下代码对我来说很好。
country<-c("Norway","Sweden","Finland","France","Greece","Italy","Spain")
df <- data.frame(names = character(0),facts = character(0),nm = character(0))
for(i in country){
site <- paste("http://www.countryreports.org/country/",i,".htm", sep="")
site <- html(site)
stats<-
data.frame(names =site %>% html_nodes(xpath="//*/td[1]") %>% html_text() ,
facts =site %>% html_nodes(xpath="//*/td[2]") %>% html_text() ,
stringsAsFactors=FALSE)
stats$nm <- i
stats$names <- gsub('[\r\n\t]', '', stats$names)
stats$facts <- gsub('[\r\n\t]', '', stats$facts)
#stats<-stats[!duplicated(stats),]
#all<-rbind(all,stats)
df <- rbind(df, stats)
#all <- merge(Output,stats)
}
View(df)
答案 2 :(得分:0)
这就是我所做的。它不是最好的解决方案,但你会得到一个输出。这也只是一种解决方法。我不建议您在运行循环时将表输出写入文件。干得好。从stats
生成输出后,
output<-rbind(stats,i)
然后将表写入,
write.table(output, file = "D:\\Documents\\HTML\\Test of loop.csv", row.names = FALSE, append = TRUE, sep = ",")
#then close the loop
}
祝你好运