Question

我试图抓住一个网站，其中包含您填写日期的表单，然后提交。然后下载CSV。

rm(list = ls())

library(rvest)


url <- "http://itc.aeso.ca/itc/public/queryHistoricalIntertieReport.do"

pgsession<-html_session(url)


pgform <- html_form(pgsession)[[1]]
filled_form<- 
  set_values(
    pgform, 
    availableEffectiveDate="943279200000 1999-11-22 07:00:00 MST (1999-11-22 14:00:00 GMT)", 
    availableExpiryDate="1561960800000 2019-07-01 00:00:00 MDT (2019-07-01 06:00:00 GMT)",
    fileFormat="CSV",
    startDate="2018-05-01", 
    endDate="2018-05-02"
  )


html_nodes(pgsession, "table") %>%
html_table(fill=TRUE)

似乎没有提交表单组件。所有回来的事情都是乱七八糟的混乱，需要开始日期/停止日期＆＃34;信息。

非常感谢任何帮助。

Answer 1

我能够弄清楚。我不确切知道发生了什么，但我的思考过程是：

1）我似乎需要得到一个jsessionid。 2）它需要作为POST请求提交，而不是GET请求。

library(curl)
library(xml2)
library(httr)
library(rvest)
library(stringi)
library(dplyr)
library(stringr)
library(lubridate)

url <- "http://itc.aeso.ca/itc/public/"

# warm up the curl handle
start <- GET(url)

# get the cookies
ck <- handle_cookies(handle_find(url)$handle)

# make the POST request
res <-     POST(paste(url,"/queryHistoricalIntertieReport.do",";jsessionid=",sep="") %s+% ck[1,]$value,
        user_agent("Mozilla/5.0 (Macintosh; Intel Mac OS X 10.11; rv:40.0) Gecko/20100101 Firefox/40.0"),
        accept("*/*"),
        encode="form",
        multipart=FALSE, # this gens a warning but seems to be necessary
        add_headers(Referer=url),
        body=list(`startDate`=seq_dates[D],
                  `endDate`=seq_dates[D],
                  `fileFormat`="CSV",
                  `availableEffectiveDate`="943279200000 1999-11-22 07:00:00 MST (1999-11-22 14:00:00 GMT)",
                  `availableExpiryDate`="943279200000 1999-11-22 07:00:00 MST (1999-11-22 14:00:00 GMT)"))


tmp <- textConnection(rawToChar(res$content))

datIn <- read.csv(tmp, stringsAsFactors=FALSE, header=F)

正文/列表部分是表格I中填写的字段，包括2个隐藏字段。通过Chrome中的F12开发工具找到这个，然后单击直到找到表单ID为止。

我从这里拿走的其余部分： R web scraper with jsessionid

不知道User_Agent的内容是什么，但我在PC上，Mac的东西仍然有效。

R - 使用RVEST提交Web表单

1 个答案: