使用Rvest登录网站进行刮痧时出现错误

时间:2016-10-22 23:49:26

标签: r session web-scraping http-status-code-403 rvest

我正在尝试在需要登录的网站上抓取一个页面并且一直在收到403错误。

我已修改了我网站的这两篇帖子的代码,Using rvest or httr to log in to non-standard forms on a webpagehow to reuse a session to avoid repeated login when scraping with rvest?

library(rvest)
pgsession <- html_session("https://www.optionslam.com/earnings/stocks/MSFT?page=-1")
pgform <- html_form(pgsession)[[1]]
filled_form <- set_values(pgform, 'username'='user', 'password'='pass')
s <- submit_form(pgsession, filled_form) # s is your logged in session

运行代码时,我收到此消息:

Submitting with 'NULL'
Warning message:
In request_POST(session, url = url, body = request$values, encode = request$encode,  :
  Forbidden (HTTP 403).

我也以这种方式运行代码,将user_agent更新为R.S.但是,在评论中建议我收到与上面相同的错误。

library(rvest)
library(httr)
uastring <- "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/54.0.2840.71 Safari/537.36"
pgsession <- html_session("https://www.optionslam.com/earnings/stocks/MSFT?page=-1", user_agent(uastring))
pgform <- html_form(pgsession)[[1]]
filled_form <- set_values(pgform, 'username'='user', 'password'='pass')
s <- submit_form(pgsession, filled_form) # s is your logged in session

如果你在没有登录的情况下拉起页面,它会在文本下方的右下方显示一些数据表:“可用的收入事件:65”

登录后,它将显示所有65个事件,表格将填入我要下载的内容。我已经准备好了所有必要的代码,但我只是停留在登录部分。

感谢您的帮助。

2 个答案:

答案 0 :(得分:4)

使用R.S。的建议,我使用RSelenium成功登录。

使用chrome或phantom的mac用户快速注释。我正在运行El Capitan因此有一些问题让mac识别两个bin文件的路径。相反,我将bin文件移动到/ usr / local / bin,它们运行时没有问题。

以下是执行此操作的代码:

q

这也可以用幻像完成,

library(RSelenium)
RSelenium::startServer()
remDr <- remoteDriver(browserName = "chrome")
remDr$open()
appURL <- 'https://www.optionslam.com/accounts/login/'
remDr$navigate(appURL)
remDr$findElement("id", "id_username")$sendKeysToElement(list("user"))
remDr$findElement("id", "id_password")$sendKeysToElement(list("password", key='enter'))

appURL <- 'https://www.optionslam.com/earnings/stocks/MSFT?page=-1'
remDr$navigate(appURL)

答案 1 :(得分:1)

以下是使用rvest解决原始用例中问题的答案:

   library(rvest)
   library(httr)
   uastring <- "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/54.0.2840.71 Safari/537.36"

   pgsession <- html_session("https://www.optionslam.com/earnings/stocks/MSFT?page=-1", user_agent(uastring))

   pgform <- html_form(pgsession)[[1]]

   filled_form <- set_values(pgform,
                             username = 'un',
                             password = 'ps')

   s <- submit_form(pgsession, filled_form, submit = NULL, config(referer = pgsession$url)) # s is your logged in session

要求您需要了解您来自的页面(referer(原文如此))。

config(referer = pgsession$url)