我正在尝试从具有大量csv
的网页上抓取并下载csv
文件。
代码:
# Libraries
library(rvest)
library(httr)
# URL
url <- "http://data.gdeltproject.org/events/index.html"
# The csv's I want are from 14 through 378 (2018 year)
selector_nodes <- seq(from = 14, to = 378, by = 1)
# HTML read / rvest action
link <- url %>%
read_html() %>%
html_nodes(paste0("body > ul > li:nth-child(", (gdelt_nodes), ")> a")) %>%
html_attr("href")
我收到此错误:
Error in xpath_search(x$node, x$doc, xpath = xpath, nsMap = ns, num_results = Inf) :
Expecting a single string value: [type=character; extent=365].
如何告诉我我希望节点14到378正确?
在分配完该代码之后,我将运行一个快速for
循环并下载所有2018年的csv。
答案 0 :(得分:0)
有关逐步解决方案,请参见代码中的注释。
library(rvest)
# URL
url <- "http://data.gdeltproject.org/events/index.html"
# Read the page in once then attempt to process it.
page <- url %>% read_html()
#extract the file list
filelist<-page %>% html_nodes("ul li a") %>% html_attr("href")
#filter for files from 2018
filelist<-filelist[grep("2018", filelist)]
#Loop would go here to download all of the pages
#pause between file downloads and then download a file
Sys.sleep(1)
download.file(paste0("http://data.gdeltproject.org/events/", filelist[1]), filelist[1])