从R嵌入网站的网站上抓取链接

时间:2015-05-15 15:21:46

标签: r loops web-scraping

我想请求帮助创建我需要下载多个文档的链接列表。

我正试图下载捷克共和国选区的动画。它可以在网站http://data.cuzk.cz/kontroly-dat-isui/00-volebni-okrsky/CSV-2014-10-01/上找到。但是,这些表以层次结构(区域 - 县 - 区)提供,大约有2 000个表,因此很难手动下载它们。

我已经找到了如何在层次结构中的每个网站中收集链接,但是找到如何编写适用于特定级别的所有网页的代码将是完美的。

#"scrape" links for regions
url <- "http://data.cuzk.cz/kontroly-dat-isui/00-volebni-okrsky/CSV-2014-10-01/"
webpage<-  getURL(url,encoding="UTF-8")
PARSED <- htmlParse(webpage)
regions <- xpathSApply(PARSED, "//a", xmlValue)
links <-paste(url, regions, "/", sep="")

#"scrape" links for the counties in the first region (but I need to download links also in all other regions)
url_county <- "http://data.cuzk.cz/kontroly-dat-isui/00-volebni-okrsky/CSV-2014-10-01/Jihocesky_kraj/"
webpage_county <-  getURL(url_county,encoding="UTF-8")
PARSED_county <- htmlParse(webpage_county)
county <- xpathSApply(PARSED_county, "//a", xmlValue)
links_counties <-paste(url, county, "/", sep="")

#and finally links for the districts in the county (but I need to download links also in all other counties in all other regions)
url_district <- "http://data.cuzk.cz/kontroly-dat-isui/00-volebni-okrsky/CSV-2014-10-01/Jihocesky_kraj/Ceske_Budejovice/"
webpage_district <-  getURL(url_district,encoding="UTF-8")
PARSED_district <- htmlParse(webpage_district)
district <- xpathSApply(PARSED_district, "//a", xmlValue)
links_districts <-paste(url, district,  sep="")

我尝试使用循环,但它不起作用。

for(i in 1:length(links)){
webpage_county <-  getURL(links[i],encoding="UTF-8")
PARSED_county  <- htmlParse(webpage_county)
links_counties <- xpathSApply(PARSED_county , "//a", xmlValue)
}

是否有人有任何建议如何解决这个问题?

0 个答案:

没有答案
相关问题