我正在尝试使用Mechanize登录Google文档以便我可以抓取某些内容(不可能从API中删除),但在尝试遵循元重定向时,我似乎仍然保持获得404:
require 'rubygems'
require 'mechanize'
USERNAME = "..."
PASSWORD = "..."
LOGIN_URL = "https://www.google.com/accounts/Login?hl=en&continue=http://docs.google.com/"
agent = Mechanize.new
login_page = agent.get(LOGIN_URL)
login_form = login_page.forms.first
login_form.Email = USERNAME
login_form.Passwd = PASSWORD
login_response_page = agent.submit(login_form)
redirect = login_response_page.meta[0].uri.to_s
puts "redirect: #{redirect}"
followed_page = agent.get(redirect) # throws a HTTPNotFound exception
pp followed_page
有人能看出为什么这不起作用吗?
答案 0 :(得分:4)
安迪,你真棒! 您的代码帮助我使我的脚本可行并登录到Google帐户。几个小时后我发现了你的错误。它是关于html转义的。正如我所发现的那样,Mechanize会自动转义它作为'get'方法的参数。所以我的解决方案是:
EMAIL = ".."
PASSWD = ".."
agent = Mechanize.new{ |a| a.log = Logger.new("mech.log")}
agent.user_agent_alias = 'Linux Mozilla'
agent.open_timeout = 3
agent.read_timeout = 4
agent.keep_alive = true
agent.redirect_ok = true
LOGIN_URL = "https://www.google.com/accounts/Login?hl=en"
login_page = agent.get(LOGIN_URL)
login_form = login_page.forms.first
login_form.Email = EMAIL
login_form.Passwd = PASSWD
login_response_page = agent.submit(login_form)
redirect = login_response_page.meta[0].uri.to_s
puts redirect.split('&')[0..-2].join('&') + "&continue=https://www.google.com/"
followed_page = agent.get(redirect.split('&')[0..-2].join('&') + "&continue=https://www.google.com/adplanner")
pp followed_page
这对我来说很好。我已经用新标签替换了meta标签中的continue参数(已经被转义)。