以编程方式在R

时间:2016-11-08 23:41:21

标签: r curl httr

我正在尝试使用R及其基于curl的webscraping库访问下面屏幕截图中突出显示的响应标题:位置文本。通过访问http://www.worldvaluessurvey.org/WVSDocumentationWVL.jsp,点击任何数据文件的下载,并填写协议表,我们可以轻松地在任何Web浏览器中找到这一点。下载将在Web浏览器中自动开始。

enter image description here

我认为获取有效Cookie的唯一方法是使用library(curlconverter)(请参阅How to download a file behind a semi-broken javascript asp function with R)但该答案似乎不足以以编程方式确定 http url 该文件的文件,仅在已知压缩文件时下载。

我已经使用不同的httr和curlconverter代码粘贴了下面的代码,但我在这里遗漏了一些内容。同样,唯一的目标是以编程方式完全确定R(跨平台)内的突出显示文本。

library(curlconverter)
library(httr)

browserPOST <-
    "curl 'http://www.worldvaluessurvey.org/AJDownload.jsp'
    -H 'Accept:text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8'
    -H 'Accept-Encoding:gzip, deflate'
    -H 'Accept-Language:en-US,en;q=0.8'
    -H 'Cache-Control:max-age=0'
    --compressed -H 'Connection:keep-alive'
    -H 'Content-Length:188'
    -H 'Content-Type:application/x-www-form-urlencoded'
    -H 'Cookie:ASPSESSIONIDCASQAACD=IBLGBFOAEHFILMMJJCFEOEMI; JSESSIONID=50DABDEDD0B2FC370C415B4BD1855260; __atuvc=13%7C45; __atuvs=58224f37d312c42400c'
    -H 'Host:www.worldvaluessurvey.org'
    -H 'Origin:http://www.worldvaluessurvey.org'
    -H 'Referer:http://www.worldvaluessurvey.org/AJDownloadLicense.jsp'
    -H 'Upgrade-Insecure-Requests:1'
    -H 'User-Agent:Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/54.0.2840.71 Safari/537.36'"

form_data <-
    list( 
        ulthost = "WVS" ,
        CMSID = "" ,
        LITITLE = "" ,
        LINOMBRE = "fas" ,
        LIEMPRESA = "asf" ,
        LIEMAIL = "asdf" ,
        LIPROJECT = "asfd" ,
        LIUSE = "1" ,
        LIPURPOSE = "asdf" ,
        LIAGREE = "1" ,
        DOID = "3996" ,
        CndWAVE = "-1" ,
        SAID = "-1" ,
        AJArchive = "WVS Data Archive" ,
        EdFunction = "" ,
        DOP = "" 
    )   



getDATA <- (straighten(browserPOST) %>% make_req)[[1]]()

a <- VERB(verb = "POST", url = "http://www.worldvaluessurvey.org/AJDownload.jsp", 
    httr::add_headers(Accept = "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8", 
        `Accept-Encoding` = "gzip, deflate", `Accept-Language` = "en-US,en;q=0.8", 
        `Cache-Control` = "max-age=0", Connection = "keep-alive", 
        `Content-Length` = "188", Host = "www.worldvaluessurvey.org", 
        Origin = "http://www.worldvaluessurvey.org", Referer = "http://www.worldvaluessurvey.org/AJDownloadLicense.jsp", 
        `Upgrade-Insecure-Requests` = "1", `User-Agent` = "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/54.0.2840.71 Safari/537.36"), 
    httr::set_cookies(`Cookie:ASPSESSIONIDCASQAACD` = "IBLGBFOAEHFILMMJJCFEOEMI", 
        JSESSIONID = "50DABDEDD0B2FC370C415B4BD1855260", `__atuvc` = "13%7C45", 
        `__atuvs` = "58224f37d312c42400c"), encode = "form",body=form_data)

2 个答案:

答案 0 :(得分:5)

这是一个很好的挑战!

问题与R语言无关。如果我们只是尝试将一些数据发布到下载脚本,我们将在任何语言中获得相同的结果。我们必须在这里处理某种安全“模式”。该站点限制用户检索文件URL,并要求用户填写带有数据的表单以提供这些链接。如果浏览器可以检索这些链接,那么我们也可以编写正确的HTTP调用。事实上,我们需要确切地知道我们必须做出哪些电话。为了找到这个,我们需要看到每当有人点击下载时网站所做的各个调用。以下是我在成功302 AJDownload.jsp POST电话之前发现的一些电话:

Http requests

我们可以清楚地看到它,如果我们查看AJDocumentation.jsp来源,它会使用jQuery $.get进行这些调用:

$.get("http://ipinfo.io?token=xxxxxxxxxxxxxx", function (response) {
    var geodatos=encodeURIComponent(response.ip+"\t"+response.country+"\t"+response.postal+"\t"+
    response.loc+"\t"+response.region+"\t"+response.city+"\t"+
    response.org);

    $.get("jdsStatJD.jsp?ID="+geodatos+
        "&url=http%3A%2F%2Fwww.worldvaluessurvey.org%2FAJDocumentation.jsp&referer=null&cms=Documentation",
        function (resp2) {
    });
}, "jsonp");

然后,在下面的几个电话中,我们可以看到成功的POST /AJDownload.jsp状态302 Moved Temporarily以及响应标头中想要的Location

Http requests

HTTP/1.1 302 Moved Temporarily
Content-Length: 0
Content-Type: text/html
Location: http://www.worldvaluessurvey.org/wvsdc/CO00001/F00003724-WVS_Longitudinal_1981-2014_stata_dta_v_2015_04_18.zip
Server: Microsoft-IIS/7.5
X-Powered-By: ASP.NET
Date: Thu, 01 Dec 2016 16:24:37 GMT

所以,这是该网站的安全机制。在用户即将通过点击链接启动下载之前,它使用ipinfo.io来存储有关其IP,位置甚至ISP组织的访问者信息。接收这些数据的脚本是/jdsStatJD.jsp。我没有使用ipinfo.io,也没有使用它们的API密钥(将它隐藏在我的屏幕截图上)而是创建了一个虚拟的有效数据序列,只是为了验证请求。根本不需要“受保护”文件的发布表单数据。可以在不发布这些数据的情况下下载文件。

此外,不需要curlconverter库。我们所要做的就是使用GET库进行简单的POSThttr请求。我想指出的一个重要部分是,为了防止httr POST函数在我们上次调用时跟随Location状态收到的302标题,我们需要使用配置设置config(followlocation = FALSE)当然会阻止它跟随Location,让我们从标题中提取Location

<强>输出

我的R脚本可以从命令行运行,它可以接受DOID参数的数值来获取所需的文件。例如,如果我们想要获取文件WVS_Longitudinal_1981-2014_stata_dta_v_2015_04_18的链接,那么我们必须在调用时将其DOID 3724 )添加到脚本的末尾它使用Rscript命令:

Rscript wvs_fetch_downloads.r 3724
[1] "http://www.worldvaluessurvey.org/wvsdc/CO00001/F00003724-WVS_Longitudinal_1981-2014_stata_dta_v_2015_04_18.zip"

我创建了一个R函数,只需传递DOID即可获得所需的每个文件位置:

getFileById <- function(fileId)

您可以删除命令行参数解析并直接传递DOID来使用该函数:

#args <- commandArgs(TRUE)
#if(length(args) == 0) {
#   print("No file id specified. Use './script.r ####'.")
#   quit("no")
#}

#fileId <- args[1]
fileId <- "3724"

# DOID=3843 : WVS_EVS_Integrated_Dictionary_Codebook v_2014_09_22 (Excel)
# DOID=3844 : WVS_Values Surveys Integrated Dictionary_TimeSeries_v_2014-04-25 (Excel)
# DOID=3725 : WVS_Longitudinal_1981-2014_rdata_v_2015_04_18
# DOID=3996 : WVS_Longitudinal_1981-2014_sas_v_2015_04_18
# DOID=3723 : WVS_Longitudinal_1981-2014_spss_v_2015_04_18
# DOID=3724 : WVS_Longitudinal_1981-2014_stata_dta_v_2015_04_18

getFileById(fileId)

最终R工作脚本

library(httr)

getFileById <- function(fileId) {
    response <- GET(
        url = "http://www.worldvaluessurvey.org/AJDocumentation.jsp?CndWAVE=-1", 
        add_headers(
            `Accept` = "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8", 
            `Accept-Encoding` = "gzip, deflate",
            `Accept-Language` = "en-US,en;q=0.8", 
            `Cache-Control` = "max-age=0",
            `Connection` = "keep-alive", 
            `Host` = "www.worldvaluessurvey.org", 
            `User-Agent` = "Mozilla/5.0 (Windows NT 10.0; WOW64; rv:50.0) Gecko/20100101 Firefox/50.0",
            `Content-type` = "application/x-www-form-urlencoded",
            `Referer` = "http://www.worldvaluessurvey.org/AJDownloadLicense.jsp", 
            `Upgrade-Insecure-Requests` = "1"))

    set_cookie <- headers(response)$`set-cookie`
    cookies <- strsplit(set_cookie, ';')
    cookie <- cookies[[1]][1]

    response <- GET(
        url = "http://www.worldvaluessurvey.org/jdsStatJD.jsp?ID=2.72.48.149%09IT%09undefined%0941.8902%2C12.4923%09Lazio%09Roma%09Orange%20SA%20Telecommunications%20Corporation&url=http%3A%2F%2Fwww.worldvaluessurvey.org%2FAJDocumentation.jsp&referer=null&cms=Documentation", 
        add_headers(
            `Accept` = "*/*", 
            `Accept-Encoding` = "gzip, deflate",
            `Accept-Language` = "en-US,en;q=0.8", 
            `Cache-Control` = "max-age=0",
            `Connection` = "keep-alive", 
            `X-Requested-With` = "XMLHttpRequest",
            `Host` = "www.worldvaluessurvey.org", 
            `User-Agent` = "Mozilla/5.0 (Windows NT 10.0; WOW64; rv:50.0) Gecko/20100101 Firefox/50.0",
            `Content-type` = "application/x-www-form-urlencoded",
            `Referer` = "http://www.worldvaluessurvey.org/AJDocumentation.jsp?CndWAVE=-1",
            `Cookie` = cookie))

    post_data <- list( 
        ulthost = "WVS",
        CMSID = "",
        CndWAVE = "-1",
        SAID = "-1",
        DOID = fileId,
        AJArchive = "WVS Data Archive",
        EdFunction = "",
        DOP = "",
        PUB = "")  

    response <- POST(
        url = "http://www.worldvaluessurvey.org/AJDownload.jsp", 
        config(followlocation = FALSE),
        add_headers(
            `Accept` = "*/*", 
            `Accept-Encoding` = "gzip, deflate",
            `Accept-Language` = "en-US,en;q=0.8", 
            `Cache-Control` = "max-age=0",
            `Connection` = "keep-alive",
            `Host` = "www.worldvaluessurvey.org",
            `User-Agent` = "Mozilla/5.0 (Windows NT 10.0; WOW64; rv:50.0) Gecko/20100101 Firefox/50.0",
            `Content-type` = "application/x-www-form-urlencoded",
            `Referer` = "http://www.worldvaluessurvey.org/AJDocumentation.jsp?CndWAVE=-1",
            `Cookie` = cookie),
        body = post_data,
        encode = "form")

    location <- headers(response)$location
    location
}

args <- commandArgs(TRUE)
if(length(args) == 0) {
    print("No file id specified. Use './script.r ####'.")
    quit("no")
}

fileId <- args[1]

# DOID=3843 : WVS_EVS_Integrated_Dictionary_Codebook v_2014_09_22 (Excel)
# DOID=3844 : WVS_Values Surveys Integrated Dictionary_TimeSeries_v_2014-04-25 (Excel)
# DOID=3725 : WVS_Longitudinal_1981-2014_rdata_v_2015_04_18
# DOID=3996 : WVS_Longitudinal_1981-2014_sas_v_2015_04_18
# DOID=3723 : WVS_Longitudinal_1981-2014_spss_v_2015_04_18
# DOID=3724 : WVS_Longitudinal_1981-2014_stata_dta_v_2015_04_18

getFileById(fileId)

答案 1 :(得分:0)

根据the source of the underlying httr::request_perform,您从VERB()获得的对象如下所示:

res <- response(
  url = resp$url,
  status_code = resp$status_code,
  headers = headers,
  all_headers = all_headers,
  cookies = curl::handle_cookies(handle),
  content = resp$content,
  date = date,
  times = resp$times,
  request = req,
  handle = handle
)

所以,您对其headersall_headersresponse is but a structure)感兴趣。如果涉及重定向,则all_headers将包含由curl::parse_headers()返回的多组标头,headers始终是最终设置。