Question

我正在尝试从具有大量csv的网页上抓取并下载csv文件。

代码：

# Libraries
library(rvest)
library(httr)

# URL
url <- "http://data.gdeltproject.org/events/index.html"

# The csv's I want are from 14 through 378 (2018 year)
selector_nodes <- seq(from = 14, to = 378, by = 1)

# HTML read / rvest action
link <- url %>% 
  read_html() %>% 
  html_nodes(paste0("body > ul > li:nth-child(", (gdelt_nodes), ")> a")) %>% 
  html_attr("href")

我收到此错误：

 Error in xpath_search(x$node, x$doc, xpath = xpath, nsMap = ns, num_results = Inf) : 
   Expecting a single string value: [type=character; extent=365].

如何告诉我我希望节点14到378正确？

在分配完该代码之后，我将运行一个快速for循环并下载所有2018年的csv。

Answer 1

有关逐步解决方案，请参见代码中的注释。

library(rvest)

# URL
url <- "http://data.gdeltproject.org/events/index.html"

# Read the page in once then attempt to process it.
page <- url %>% read_html() 

#extract the file list
filelist<-page %>% html_nodes("ul li a") %>% html_attr("href")

#filter for files from 2018
filelist<-filelist[grep("2018", filelist)]

#Loop would go here to download all of the pages
#pause between file downloads and then download a file
Sys.sleep(1)
download.file(paste0("http://data.gdeltproject.org/events/", filelist[1]), filelist[1])

R：抓取许多压缩的CSV并下载本地计算机

1 个答案: