调试基于RCurl的身份验证&表格提交

时间:2014-03-11 04:27:45

标签: r forms authentication web-scraping rcurl

SourceForge研究数据档案(SRDA)是我论文研究的数据来源之一。我在调试与SRDA数据收集相关的以下问题时遇到了困难。

SRDA的数据收集需要身份验证,然后使用SQL查询提交Web表单。成功处理查询后,系统会生成带有查询结果的文本文件。在测试我的R代码以进行SRDA数据收集时,我已经更改了SQL请求以确保重新生成结果文件。但是,我发现文件内容保持不变(对应于之前的查询)。我认为缺少文件内容的刷新可能是由于身份验证查询表单提交的失败。以下是代码(https://github.com/abnova/diss-floss/blob/master/import/getSourceForgeData.R)的调试输出:

make importSourceForge

Rscript --no-save --no-restore --verbose getSourceForgeData.R
running
  '/usr/lib/R/bin/R --slave --no-restore --no-save --no-restore --file=getSourceForgeData.R'

Loading required package: RCurl
Loading required package: methods
Loading required package: bitops
Loading required package: digest

Retrieving SourceForge data...

Checking request "SELECT *
FROM sf1104.users a, sf1104.artifact b
WHERE a.user_id = b.submitted_by AND b.artifact_id = 304727"...
* About to connect() to zerlot.cse.nd.edu port 80 (#0)
*   Trying 129.74.152.47... * connected
> POST /mediawiki/index.php?title=Special:Userlogin&action=submitlogin&type=login HTTP/1.1
Host: zerlot.cse.nd.edu
Accept: */*
Content-Length: 37
Content-Type: application/x-www-form-urlencoded

* upload completely sent off: 37out of 37 bytes
< HTTP/1.1 200 OK
< Date: Tue, 11 Mar 2014 03:49:04 GMT
< Server: Apache/2.2.8 (Ubuntu) PHP/5.2.4-2ubuntu5.25 with Suhosin-Patch
< X-Powered-By: PHP/5.2.4-2ubuntu5.25
* Added cookie wiki_db_session="c61...a3c" for domain zerlot.cse.nd.edu, path /, expire 0
< Set-Cookie: wiki_db_session=c61...a3c; path=/
< Content-language: en
< Vary: Accept-Encoding,Cookie
< Expires: Thu, 01 Jan 1970 00:00:00 GMT
< Cache-Control: private, must-revalidate, max-age=0
< Transfer-Encoding: chunked
< Content-Type: text/html; charset=UTF-8
<
* Connection #0 to host zerlot.cse.nd.edu left intact
[1] "Before second postForm()"
* Re-using existing connection! (#0) with host zerlot.cse.nd.edu
* Connected to zerlot.cse.nd.edu (129.74.152.47) port 80 (#0)
> POST /cgi-bin/form.pl HTTP/1.1
Host: zerlot.cse.nd.edu
Accept: */*
Cookie: wiki_db_session=c61...a3c
Content-Length: 129
Content-Type: application/x-www-form-urlencoded

* upload completely sent off: 129out of 129 bytes
< HTTP/1.1 500 Internal Server Error
< Date: Tue, 11 Mar 2014 03:49:04 GMT
< Server: Apache/2.2.8 (Ubuntu) PHP/5.2.4-2ubuntu5.25 with Suhosin-Patch
< Vary: Accept-Encoding
< Connection: close
< Transfer-Encoding: chunked
< Content-Type: text/html
<
* Closing connection #0
Error: Internal Server Error
Execution halted
make: *** [importSourceForge] Error 1

我尝试使用调试输出以及Firefox嵌入式开发人员工具中的网络协议分析器来解决这个问题,但到目前为止还没有取得多大成功。非常感谢任何建议和帮助。

更新:

if (!require(RCurl)) install.packages('RCurl')
if (!require(digest)) install.packages('digest')

library(RCurl)
library(digest)

# Users must authenticate to access Query Form
SRDA_HOST_URL  <- "http://zerlot.cse.nd.edu"
SRDA_LOGIN_URL <- "/mediawiki/index.php?title=Special:Userlogin"
SRDA_LOGIN_REQ <- "&action=submitlogin&type=login"

# SRDA URL that Query Form sends POST requests to
SRDA_QUERY_URL <- "/cgi-bin/form.pl"

# SRDA URL that Query Form sends POST requests to
SRDA_QRESULT_URL <- "/qresult/blekh/blekh.txt"

# Parameters for result's format
DATA_SEP <- ":" # data separator
ADD_SQL  <- "1" # add SQL to file

curl <<- getCurlHandle()

srdaLogin <- function (loginURL, username, password) {

  curlSetOpt(curl = curl, cookiejar = 'cookies.txt',
             ssl.verifyhost = FALSE, ssl.verifypeer = FALSE,
             followlocation = TRUE, verbose = TRUE)

  params <- list('wpName1' = username, 'wpPassword1' = password)

  if(url.exists(loginURL)) {
    reply <- postForm(loginURL, .params = params, curl = curl,
                      style = "POST")
    #if (DEBUG) print(reply)
    info <- getCurlInfo(curl)
    return (ifelse(info$response.code == 200, TRUE, FALSE))
  }
  else {
    error("Can't access login URL!")
  }
}


srdaConvertRequest <- function (request) {

  return (list(select = "*",
               from = "sf1104.users a, sf1104.artifact b",
               where = "b.artifact_id = 304727"))
}


srdaRequestData <- function (requestURL, select, from, where, sep, sql) {

  params <- list('uitems' = select,
                 'utables' = from,
                 'uwhere' = where,
                 'useparator' = sep,
                 'append_query' = sql)

  if(url.exists(requestURL)) {
    reply <- postForm(requestURL, .params = params, #.opts = opts,
                      curl = curl, style = "POST")
  }
}


srdaGetData <- function(request) {

  resultsURL <- paste(SRDA_HOST_URL, SRDA_QRESULT_URL,
                      collapse="", sep="")

  results.query <- readLines(resultsURL, n = 1)

  return (ifelse(results.query == request, TRUE, FALSE))
}


getSourceForgeData <- function (request) {

  # Construct SRDA login and query URLs
  loginURL <- paste(SRDA_HOST_URL, SRDA_LOGIN_URL, SRDA_LOGIN_REQ,
                    collapse="", sep="")
  queryURL <- paste(SRDA_HOST_URL, SRDA_QUERY_URL, collapse="", sep="")

  # Log into the system 
  if (!srdaLogin(loginURL, USER, PASS))
    error("Login failed!")

  rq <- srdaConvertRequest(request)

  srdaRequestData(queryURL,
                  rq$select, rq$from, rq$where, DATA_SEP, ADD_SQL)

  if (!srdaGetData(request))
    error("Data collection failed!")
}


message("\nTesting SourceForge data collection...\n")

getSourceForgeData("SELECT * 
FROM sf1104.users a, sf1104.artifact b 
WHERE a.user_id = b.submitted_by AND b.artifact_id = 304727")

# clean up
close(curl)

更新2(无功能版本):

if (!require(RCurl)) install.packages('RCurl')
library(RCurl)

# Users must authenticate to access Query Form
SRDA_HOST_URL  <- "http://zerlot.cse.nd.edu"
SRDA_LOGIN_URL <- "/mediawiki/index.php?title=Special:Userlogin"
SRDA_LOGIN_REQ <- "&action=submitlogin&type=login"

# SRDA URL that Query Form sends POST requests to
SRDA_QUERY_URL <- "/cgi-bin/form.pl"

# SRDA URL that Query Form sends POST requests to
SRDA_QRESULT_URL <- "/qresult/blekh/blekh.txt"

# Parameters for result's format
DATA_SEP <- ":" # data separator
ADD_SQL  <- "1" # add SQL to file


message("\nTesting SourceForge data collection...\n")

curl <- getCurlHandle()

curlSetOpt(curl = curl, cookiejar = 'cookies.txt',
           ssl.verifyhost = FALSE, ssl.verifypeer = FALSE,
           followlocation = TRUE, verbose = TRUE)

# === Authentication ===

loginParams <- list('wpName1' = USER, 'wpPassword1' = PASS)

loginURL <- paste(SRDA_HOST_URL, SRDA_LOGIN_URL, SRDA_LOGIN_REQ,
                  collapse="", sep="")

if (url.exists(loginURL)) {
  postForm(loginURL, .params = loginParams, curl = curl, style = "POST")
  info <- getCurlInfo(curl)
  message("\nLogin results - HTTP status code: ", info$response.code, "\n\n")
} else {
  error("\nCan't access login URL!\n\n")
}

# === Data collection ===

# Previous query was: "SELECT * FROM sf0305.users WHERE user_id < 100"
query <- list(select = "*",
              from = "sf1104.users a, sf1104.artifact b",
              where = "b.artifact_id = 304727") 

getDataParams <- list('uitems'       = query$select,
                      'utables'      = query$from,
                      'uwhere'       = query$where,
                      'useparator'   = DATA_SEP,
                      'append_query' = ADD_SQL)

queryURL <- paste(SRDA_HOST_URL, SRDA_QUERY_URL, collapse="", sep="")

if(url.exists(queryURL)) {
  postForm(queryURL, .params = getDataParams, curl = curl, style = "POST")
  resultsURL <- paste(SRDA_HOST_URL, SRDA_QRESULT_URL,
                      collapse="", sep="")
  results.query <- readLines(resultsURL, n = 1)
  request <- paste(query$select, query$from, query$where)
  if (results.query == request)
    message("\nData request is successful, SQL query: ", request, "\n\n")
  else
    message("\nData request failed, SQL query: ", request, "\n\n")
} else {
  error("\nCan't access data query URL!\n\n")
}

close(curl)

更新3(服务器端调试)

最后,我能够与负责该系统的人取得联系,并帮助我将问题缩小到 cookie管理恕我直言。这是错误日志记录,对应于运行我的代码:

  

[2014年3月21日星期五15:33:14] [错误] [客户54.204.180.203] [3月21日星期五]   2014年15:33:14] form.pl:/ tmp / sess_3e55593e436a013597cd320e4c6a2fac:   在/var/www/cgi-bin/form.pl第43行

以下是生成该错误的服务器端脚本Perl)的片段(脚本中的第1行是bash解释器指令,因此报告第43行很可能是行号44):

42     if (-e "/tmp/sess_$file") {
43     $session = PHP::Session->new($cgi->cookie("$session_name"));
44     $user_id = $session->get('wsUserID');
45     $user_name = $session->get('wsUserName');

以下是会话信息(1)认证后的 和(2)提交数据请求后的 ,通过跟踪获得手动身份验证和手动数据请求表单提交:

  

(1)“wiki_dbUserID = 449; expires = Sun,20-Apr-2014 21:04:14 GMT;   路径= / wiki_dbUserName = Blekh; expires = Sun,20-Apr-2014 21:04:14 GMT;   路径= / wiki_dbToken =删除;到期=星期四,2013年3月21日21:04:13 GMT“

     

(2)wiki_db_session = aaed058f97059174a59effe44b137cbc;   _ga = GA1.2.2065853334.1395410153; EDSSID = e24ff5ed891c28c61f2d1f8dec424274; wiki_dbUserName = Blekh;   wiki_dbLoggedOut = 20140321210314; wiki_dbUserID = 449

感谢您解决我的代码问题的任何帮助!

3 个答案:

答案 0 :(得分:1)

最后,最后,终于!我已经想出导致这个问题的原因,这让我非常头疼(比喻和字面意思)。它迫使我花了很多时间阅读各种互联网资源(包括许多SO问题和答案),调试我的代码并与人沟通。我花了很多时间,但没有白费,因为我学到了很多关于RCurl,cookies,Web表单和HTTP协议的知识。

原因似乎比我想象的要简单得多。虽然表单提交失败的直接原因与cookie管理有关,但基础原因正在使用错误的参数名称(ID)进行身份验证表格领域。两对非常相似,只需要一个额外的字符来触发整个问题。

经验教训:在遇到问题时,尤其是处理身份验证的问题时,非常重要的是要多次检查所有名称和ID,并确保它们与应该使用的名称相对应。感谢所有帮助或试图帮助我解决这个问题的人!

答案 1 :(得分:0)

我已经进一步简化了代码:

library(httr)

base_url  <- "http://srda.cse.nd.edu"

loginURL <- modify_url(
  base_url, 
  path = "mediawiki/index.php", 
  query = list(
    title = "Special:Userlogin", 
    action = "submitlogin",
    type = "login",
    wpName1 = USER,
    wpPasswor1 = PASS
  )
)
r <- POST(loginURL)
stop_for_status(r)

queryURL <- modify_url(base_url, path = "cgi-bin/form.pl")
query <- list(
  uitems       = "user_name",
  utables      = "sf1104.users a, sf1104.artifact b",
  uwhere       = "a.user_id = b.submitted_by AND b.artifact_id = 304727",
  useparator   = ":",
  append_query = "1"
)
r <- POST(queryURL, body = query, multipart = FALSE)
stop_for_status(r)

但我仍然得到500.我试过了:

  • 设置我在浏览器中看到的额外Cookie(wiki_dbUserID,wiki_dbUserName)
  • 将标题DNT设置为1
  • 将引用设置为http://srda.cse.nd.edu/cgi-bin/form.pl
  • 将用户代理设置为与Chrome
  • 相同
  • 设置接受&#34; text / html&#34;

答案 2 :(得分:0)

以下提供了该方案的说明(错误情况)。

来自W3C RFC 2616 - HTTP / 1.1规范:

  

10.5服务器错误5xx

     

以数字&#34; 5&#34;开头的响应状态代码表明案件   服务器知道它有错误或无法执行   执行请求。除了响应HEAD请求时,   服务器应该包含一个包含错误解释的实体   情况,以及是暂时还是永久的情况。用户   代理商应该向用户显示任何包含的实体。这些回应   代码适用于任何请求方法。

     

10.5.1 500内部服务器错误

     

服务器遇到阻止它的意外情况   满足要求。

我对第10.5段的解释是,它暗示应该更详细对错误情况的解释超出所提供的错误情况在第10.5.1段中。但是,我认识到很可能状态代码500(第10.5.1段)的消息被认为是足够的。欢迎任何解释的确认!