R httr后验证下载在交互模式下工作但功能失败

时间:2018-02-23 17:06:31

标签: r cookies https web-scraping httr

以下代码在交互模式下工作正常但在函数中使用时失败。它只是两个身份验证POST命令,然后是数据下载。我的目标是让它在一个函数内部工作,而不仅仅是在交互模式下。

这个问题是this question的续集.. icpsr最近更新了他们的网站。以下最小可重复的示例需要一个免费帐户,可在

处获得

https://www.icpsr.umich.edu/rpxlogin?path=ICPSR&request_uri=https%3a%2f%2fwww.icpsr.umich.edu%2ficpsrweb%2findex.jsp

我尝试添加Sys.sleep(1)和各种httr::GET / httr::POST来电,但没有任何效果。

my_download <-
    function( your_email , your_password ){

        values <-
            list(
                agree = "yes",
                path = "ICPSR" ,
                study = "21600" ,
                ds = "" ,
                bundle = "rdata",
                dups = "yes",
                email=your_email,
                password=your_password
            )


        httr::POST("https://www.icpsr.umich.edu/cgi-bin/terms", body = values)
        httr::POST("https://www.icpsr.umich.edu/rpxlogin", body = values)

        tf <- tempfile()
        httr::GET( 
            "https://www.icpsr.umich.edu/cgi-bin/bob/zipcart2" , 
            query = values , 
            httr::write_disk( tf , overwrite = TRUE ) , 
            httr::progress()
        )

    }

# fails 
my_download( "email@address.com" , "some_password" )

# stepping through works
debug( my_download )
my_download( "email@address.com" , "some_password" )

编辑失败只是简单地下载此页面就好像没有登录(而不是数据集),因此由于某种原因它失去了身份验证。如果您已登录到icpsr,请使用隐私浏览查看页面 -

https://www.icpsr.umich.edu/cgi-bin/bob/zipcart2?study=21600&ds=1&bundle=rdata&path=ICPSR

谢谢!

1 个答案:

答案 0 :(得分:1)

这种情况可能发生,因为httr包的状态(例如cookie)存储在每个URL的handle中(请参阅?handle)。

在这种特殊情况下,仍然不清楚究竟是什么让它起作用,但一种策略是在验证和请求数据之前向GET添加https://www.icpsr.umich.edu/cgi-bin/bob/请求。例如,

my_download <-
    function( your_email , your_password ){
        ## for some reason this is required ...
        httr::GET("https://www.icpsr.umich.edu/cgi-bin/bob/")
        values <-
            list(
                agree = "yes",
                path = "ICPSR" ,
                study = "21600" ,
                ds = "" ,
                bundle = "rdata",
                dups = "yes",
                email=your_email,
                password=your_password
            )
        httr::POST("https://www.icpsr.umich.edu/rpxlogin", body = values)
        httr::POST("https://www.icpsr.umich.edu/cgi-bin/terms", body = values)
        tf <- tempfile()
        httr::GET( 
            "https://www.icpsr.umich.edu/cgi-bin/bob/zipcart2" , 
            query = values , 
            httr::write_disk( tf , overwrite = TRUE ) , 
            httr::progress()
        )
    }

似乎工作正常,但仍不清楚GET Sub PrintKeys(Data As Variant, C1 As Range) Dim key As Variant Dim Rg As Range Set Rg = C1 Dim i As Integer i = 1 For Each key In Data.Keys 'Rg.Offset(0, i).Value = key 'Rg.Offset(0, i).Interior.ColorIndex = 15 Rg = Rg.Offset(0, i) Rg.Value = key Auto_Colour (Rg) i = i + 1 Next key End Sub Sub Auto_Colour(Rg As Range) Rg.Interior.ColorIndex = 15 End Sub https://www.icpsr.umich.edu/cgi-bin/bob/'的请求到底是什么或为什么需要它。