Question

当使用open="r"创建连接时，它允许逐行读取，这对批量处理大数据流很有用。例如，this script通过一次读取100行来解析大小的gzip压缩JSON HTTP流。但不幸的是，R不支持SSL：

> readLines(url("https://api.github.com/repos/jeroenooms/opencpu"))
Error in readLines(url("https://api.github.com/repos/jeroenooms/opencpu")) : 
  cannot open the connection: unsupported URL scheme

RCurl和httr软件包确实支持HTTPS，但我认为他们无法创建类似于url()的连接对象。是否有其他方法可以逐行读取HTTPS连接，类似于上面脚本中的示例？

Answer 1

是的，RCurl可以逐行阅读＆＃34;。事实上，它总是这样做，但更高级别的功能为了方便起见隐藏了这一点。您可以使用writefunction（和标头的headerfunction）来指定每次libcurl从结果体中收到足够的字节时调用的函数。该功能可以做任何想做的事情。 RCurl包中有几个这样的例子。但这是一个简单的

curlPerform(url = "http://www.omegahat.org/index.html", 
            writefunction = function(txt, ...) { 
                                 cat("*", txt, "\n")
                                 TRUE
                            })

Answer 2

一种解决方案是通过curl手动调用pipe可执行文件。以下似乎有效。

library(jsonlite)
stream_https <- gzcon(pipe("curl https://jeroenooms.github.io/files/hourly_14.json.gz", open="r"))
batches <- list(); i <- 1
while(length(records <- readLines(gzstream, n = 100))){
  message("Batch ", i, ": found ", length(records), " lines of json...")
  json <- paste0("[", paste0(records, collapse=","), "]")
  batches[[i]] <- fromJSON(json, validate=TRUE)
  i <- i+1
}
weather <- rbind.pages(batches)
rm(batches); close(gzstream)

然而，这不是最理想的，因为curl可执行文件可能由于各种原因而不可用。通过RCurl / libcurl直接调用此管道会更好。

逐行读取R中的HTTPS连接

2 个答案: