我最初询问this question使用httr
包执行此任务,但我认为不可能使用httr
。所以我重新编写了我的代码来使用RCurl
代替 - 但我仍然踩着可能与writefunction
相关的东西......但我真的不明白为什么。
您应该能够使用32位版本的R来重现我的工作,因此如果您在RAM中读取任何内容,则会达到内存限制。我需要一个直接下载到硬盘的解决方案。
开始,此代码正常工作 - 压缩文件已妥善保存到磁盘。
library(RCurl)
filename <- tempfile()
f <- CFILE(filename, "wb")
url <- "http://www2.census.gov/acs2011_5yr/pums/csv_pus.zip"
curlPerform(url = url, writedata = f@ref)
close(f)
# 2.1 GB file successfully written to disk
现在这里有一些RCurl
代码不起作用。正如the previous question中所述,正好再现这一点需要在ipums上创建一个摘录。
your.email <- "email@address.com"
your.password <- "password"
extract.path <- "https://usa.ipums.org/usa-action/downloads/extract_files/some_file.csv.gz"
library(RCurl)
values <-
list(
"login[email]" = your.email ,
"login[password]" = your.password ,
"login[is_for_login]" = 1
)
curl = getCurlHandle()
curlSetOpt(
cookiejar = 'cookies.txt',
followlocation = TRUE,
autoreferer = TRUE,
ssl.verifypeer = FALSE,
curl = curl
)
params <-
list(
"login[email]" = your.email ,
"login[password]" = your.password ,
"login[is_for_login]" = 1
)
html <- postForm("https://usa.ipums.org/usa-action/users/validate_login", .params = params, curl = curl)
dl <- getURL( "https://usa.ipums.org/usa-action/extract_requests/download" , curl = curl)
现在我已登录,尝试使用与上面相同的命令,但使用curl
对象来保留cookie。
filename <- tempfile()
f <- CFILE(filename, mode = "wb")
此换行符 -
curlPerform(url = extract.path, writedata = f@ref, curl = curl)
close(f)
# the error is:
Error in curlPerform(url = extract.path, writedata = f@ref, curl = curl) :
embedded nul in string: [[binary jibberish here]]
我上一篇文章的回答提到了我this c-level writefunction的答案,但我对如何重新创建curl_writer C程序(在Windows上?)一无所知。
dyn.load("curl_writer.so")
writer <- getNativeSymbolInfo("writer", PACKAGE="curl_writer")$address
curlPerform(URL=url, writefunction=writer)
..或者为什么它甚至是必要的,因为这个问题顶部的五行代码没有像getNativeSymbolInfo
那样疯狂。我只是不明白为什么传入存储身份验证/ cookie的额外curl
对象并告诉它不要验证SSL会导致代码无法正常工作......会破坏?
答案 0 :(得分:2)
从this link创建名为curl_writer.c
的文件并将其保存到C:\<folder where you save your R files>
#include <stdio.h>
/**
* Original code just sent some message to stderr
*/
size_t writer(void *buffer, size_t size, size_t nmemb, void *stream) {
fwrite(buffer,size,nmemb,(FILE *)stream);
return size * nmemb;
}
打开命令窗口,转到保存curl_writer.c
的文件夹并运行R编译器
c:> cd "C:\<folder where you save your R files>"
c:> R CMD SHLIB -o curl_writer.dll curl_writer.c
打开R并运行脚本
C:> R
your.email <- "email@address.com"
your.password <- "password"
extract.path <- "https://usa.ipums.org/usa-action/downloads/extract_files/some_file.csv.gz"
library(RCurl)
values <-
list(
"login[email]" = your.email ,
"login[password]" = your.password ,
"login[is_for_login]" = 1
)
curl = getCurlHandle()
curlSetOpt(
cookiejar = 'cookies.txt',
followlocation = TRUE,
autoreferer = TRUE,
ssl.verifypeer = FALSE,
curl = curl
)
params <-
list(
"login[email]" = your.email ,
"login[password]" = your.password ,
"login[is_for_login]" = 1
)
html <- postForm("https://usa.ipums.org/usa-action/users/validate_login", .params = params, curl = curl)
dl <- getURL( "https://usa.ipums.org/usa-action/extract_requests/download" , curl = curl)
# Load the DLL you created
# "writer" is the name of the function
# "curl_writer" is the name of the dll
dyn.load("curl_writer.dll")
writer <- getNativeSymbolInfo("writer", PACKAGE="curl_writer")$address
# Note that "URL" parameter is upper case, in your code it is lowercase
# I'm not sure if that has something to do
# "writer" is the symbol defined above
f <- CFILE(filename <- tempfile(), "wb")
curlPerform(URL=url, writedata=f@ref, writefunction=writer, curl=curl)
close(f)
答案 1 :(得分:1)
现在可以使用httr
包。谢谢哈德利!