我正在尝试在需要登录的网站上抓取一个页面并且一直在收到403错误。
我已修改了我网站的这两篇帖子的代码,Using rvest or httr to log in to non-standard forms on a webpage和how to reuse a session to avoid repeated login when scraping with rvest?
library(rvest)
pgsession <- html_session("https://www.optionslam.com/earnings/stocks/MSFT?page=-1")
pgform <- html_form(pgsession)[[1]]
filled_form <- set_values(pgform, 'username'='user', 'password'='pass')
s <- submit_form(pgsession, filled_form) # s is your logged in session
运行代码时,我收到此消息:
Submitting with 'NULL'
Warning message:
In request_POST(session, url = url, body = request$values, encode = request$encode, :
Forbidden (HTTP 403).
我也以这种方式运行代码,将user_agent更新为R.S.但是,在评论中建议我收到与上面相同的错误。
library(rvest)
library(httr)
uastring <- "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/54.0.2840.71 Safari/537.36"
pgsession <- html_session("https://www.optionslam.com/earnings/stocks/MSFT?page=-1", user_agent(uastring))
pgform <- html_form(pgsession)[[1]]
filled_form <- set_values(pgform, 'username'='user', 'password'='pass')
s <- submit_form(pgsession, filled_form) # s is your logged in session
如果你在没有登录的情况下拉起页面,它会在文本下方的右下方显示一些数据表:“可用的收入事件:65”
登录后,它将显示所有65个事件,表格将填入我要下载的内容。我已经准备好了所有必要的代码,但我只是停留在登录部分。
感谢您的帮助。
答案 0 :(得分:4)
使用R.S。的建议,我使用RSelenium成功登录。
使用chrome或phantom的mac用户快速注释。我正在运行El Capitan因此有一些问题让mac识别两个bin文件的路径。相反,我将bin文件移动到/ usr / local / bin,它们运行时没有问题。
以下是执行此操作的代码:
q
这也可以用幻像完成,
library(RSelenium)
RSelenium::startServer()
remDr <- remoteDriver(browserName = "chrome")
remDr$open()
appURL <- 'https://www.optionslam.com/accounts/login/'
remDr$navigate(appURL)
remDr$findElement("id", "id_username")$sendKeysToElement(list("user"))
remDr$findElement("id", "id_password")$sendKeysToElement(list("password", key='enter'))
appURL <- 'https://www.optionslam.com/earnings/stocks/MSFT?page=-1'
remDr$navigate(appURL)
答案 1 :(得分:1)
以下是使用rvest
解决原始用例中问题的答案:
library(rvest)
library(httr)
uastring <- "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/54.0.2840.71 Safari/537.36"
pgsession <- html_session("https://www.optionslam.com/earnings/stocks/MSFT?page=-1", user_agent(uastring))
pgform <- html_form(pgsession)[[1]]
filled_form <- set_values(pgform,
username = 'un',
password = 'ps')
s <- submit_form(pgsession, filled_form, submit = NULL, config(referer = pgsession$url)) # s is your logged in session
要求您需要了解您来自的页面(referer
(原文如此))。
config(referer = pgsession$url)