R - RCurl从受密码保护的站点获取数据

时间:2014-06-05 00:19:36

标签: r scrape

我正在尝试使用R从受密码保护的网站(我有一个有效的用户名/密码)中删除一些表格数据,但尚未成功。

举个例子,这是登录我的牙医的网站:http://www.deltadentalins.com/uc/index.html

我尝试了以下内容:

library(httr)
download <- "https://www.deltadentalins.com/indService/faces/Home.jspx?_afrLoop=73359272573000&_afrWindowMode=0&_adf.ctrl-state=12pikd0f19_4"
terms <- "http://www.deltadentalins.com/uc/index.html"
values <- list(username = "username", password = "password", TARGET = "", SMAUTHREASON = "", POSTPRESERVATIONDATA = "",
bundle = "all", dups = "yes")
POST(terms, body = values)
GET(download, query = values)

我也尝试过:

your.username <- 'username'
your.password <- 'password'

require(SAScii) 
require(RCurl)
require(XML)

agent="Firefox/23.0" 
options(RCurlOptions = list(cainfo = system.file("CurlSSL", "cacert.pem", package = "RCurl")))
curl = getCurlHandle()
curlSetOpt(
cookiejar = 'cookies.txt' ,
useragent = agent,
followlocation = TRUE ,
autoreferer = TRUE ,
curl = curl
)

# list parameters to pass to the website (pulled from the source html)
params <-
list(
'lt' = "",
'_eventID' = "",
'TARGET' = "",
'SMAUTHREASON' = "",
'POSTPRESERVATIONDATA' = "",
'SMAGENTNAME' = agent,
'username' = your.username,
'password' = your.password 
    )

#logs into the form
html = postForm('https://www.deltadentalins.com/siteminderagent/forms/login.fcc', .params = params, curl = curl)

# logs into the form
html

我无法上班。有没有可以提供帮助的专家呢?

1 个答案:

答案 0 :(得分:1)

2016年3月5日更新以使用包Relenium

#### FRONT MATTER ####

library(devtools)
library(RSelenium)
library(XML)
library(plyr)

######################

## This block will open the Firefox browser, which is linked to R
RSelenium::checkForServer()
remDr <- remoteDriver() 
startServer()
remDr$open()
url="yoururl"
remDr$navigate(url)

第一部分加载所需的包,设置登录URL,然后在Firefox实例中打开它。我输入我的用户名&amp;密码,然后我就可以开始抓了。

infoTable <- readHTMLTable(firefox$getPageSource(), header = TRUE)
infoTable
Table1 <- infoTable[[1]]
Apps <- Table1[,1] # Application Numbers

对于此示例,第一页包含两个表。第一个是我感兴趣的,有一个申请号和名称表。我拿出第一栏(申请号)。

Links2 <- paste("https://yourURL?ApplicantID=", Apps2, sep="")

我想要的数据存储在invidiual应用程序中,所以这一位创建了我想要循环的链接。

### Grabs contact info table from each page

LL <- lapply(1:length(Links2),
function(i) {
url=sprintf(Links2[i])
firefox$get(url)
firefox$getPageSource()
infoTable <- readHTMLTable(firefox$getPageSource(), header = TRUE)

if("First Name" %in% colnames(infoTable[[2]]) == TRUE) infoTable2 <- cbind(infoTable[[1]][1,], infoTable[[2]][1,])

else infoTable2 <- cbind(infoTable[[1]][1,], infoTable[[3]][1,])

print(infoTable2)
}
)

results <- do.call(rbind.fill, LL)
results
write.csv(results, "C:/pathway/results2.csv")

这个最后一节循环遍历每个应用程序的链接,然后用它们的联系信息(表2或表3,因此R必须先检查)抓取表。再次感谢Chinmay Patil关于relenium的提示!