我正在尝试从服务器下载帐户事务(XML文件)。当我从浏览器输入此URL时:
https://secure.somesite.com:443/my/account/download_transactions.php?type=xml
它成功下载了正确的XML文件(假设我已经登录)。
我想用Ruby编程,并尝试使用此代码:
require 'open-uri'
require 'rexml/document'
require 'net/http'
require 'net/https'
include REXML
url = URI.parse("https://secure.somesite.com:443/my/account/download_transactions.php?type=xml")
req = Net::HTTP::Get.new(url.path)
req.basic_auth 'userid', 'password'
req.content_type = 'text/xml'
http = Net::HTTP.new(url.host, url.port)
http.use_ssl = true
response = http.start { |http| http.request(req) }
root = Document.new(response.read_body).root
root.elements.each("transaction") do |t|
id = t.elements["id"].text
description = t.elements["description"].text
puts "TRANSACTION ID='#{id}' DESCRIPTION='#{description}'"
end
执行继续,但在“Document.new”上失败:
RuntimeError: Illegal character '&' in raw string "??ࡱ?;??
如果打印出来的话,返回的正文显然不是XML,并且看起来是一长串大多数不可读的,偶尔会有一个明显的单词表示它与预期的内容有关。我还看到字符串“Arial1”与不可读的内容混合了好几次,这让我觉得我收到的格式不是XML。
我的问题是,我在这里做错了什么? XML文件绝对可用(如果您检查浏览器获取的副本,则更正)。我是否指定了SSL的错误? HTTPS请求?是否有不同的正确方法来揭示正确的身体?在此先感谢您的帮助!
检查标题的有趣想法。成功的浏览器序列从HttpLiveHeaders显示:
https://secure.somesite.com/my/account/download_transactions.php?&type=xml
GET /my/account/download_transactions.php?type=xml HTTP/1.1
Host: secure.somesite.com
User-Agent: Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10.6; en-US; rv:1.9.1.3) Gecko/20090824 Firefox/3.5.3
Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8
Accept-Language: en-us,en;q=0.5
Accept-Encoding: gzip,deflate
Accept-Charset: ISO-8859-1,utf-8;q=0.7,*;q=0.7
Keep-Alive: 300
Connection: keep-alive
Cookie: <obscured>
HTTP/1.x 200 OK
Date: Wed, 21 Oct 2009 13:13:08 GMT
Server: Apache/2.2
Expires: Thu, 19 Nov 1981 08:52:00 GMT
Cache-Control: must-revalidate, post-check=0,pre-check=0
Pragma: public
Content-Disposition: attachment; filename=stuff.xml
Connection: close
Transfer-Encoding: chunked
Content-Type: application/xml
我试图通过将上面的“接受”切换并粘贴到我的请求中来匹配所有HTTP标头位,但返回的XML文件仍然搞砸了。
我的代码返回的响应的hexdump显示了很多00x和FFx,以及单词“root”和“entry”彼此靠近。不成功的ruby序列的WireShark转储不太有用,因为它显示了SSL编码的应用程序数据。但很明显,一大堆数据正在被退回。
START DUMP
00000000: d0 cf 11 e0 a1 b1 1a e1 - 00 00 00 00 00 00 00 00 ................
00000010: 00 00 00 00 00 00 00 00 - 3b 00 03 00 fe ff 09 00 ........;.......
00000020: 06 00 00 00 00 00 00 00 - 00 00 00 00 01 00 00 00 ................
00000030: 04 00 00 00 00 00 00 00 - 00 10 00 00 00 00 00 00 ................
00000040: 01 00 00 00 fe ff ff ff - 00 00 00 00 05 00 00 00 ................
00000050: ff ff ff ff ff ff ff ff - ff ff ff ff ff ff ff ff ................
00000060: ff ff ff ff ff ff ff ff - ff ff ff ff ff ff ff ff ................
00000070: ff ff ff ff ff ff ff ff - ff ff ff ff ff ff ff ff ................
... and so on... non 00 and FF's appear much further down.
我不确定下一步该尝试什么。有什么建议吗?
答案 0 :(得分:1)
我自己修复了这个问题。事实证明,这个特定网站似乎没有使用“基本身份验证”,我被要求执行特定的登录屏幕以生成可用的cookie。我还使用“Mechanize”简化了解决方案,这是一个处理HTTP活动大部分工作的gem。
require 'rubygems'
require 'mechanize'
login_username = "theusername"
login_password = "thepassword"
# get login page
agent = WWW::Mechanize.new
agent.user_agent_alias = 'Mac Safari'
page = agent.get('https://somesite.com/login.php')
# fill out login form and submit
form = page.forms[0] # use first form on page
form['form[username]'] = login_username
form['form[password]'] = login_password
page = agent.submit(form)
# process returned page
if page.uri.to_s.include?("login")
puts '---- LOGIN FAILED ----'
else
puts '---- LOGIN SUCCESSFUL ----'
xml_data = agent.get('https://secure.somesite.com:443/download_transactions.php?type=xml')
puts xml_data.body
end
给我的东西是设置表单字段的方式,由于某些原因,这些字段与我看到的这样做的例子不同。
答案 1 :(得分:0)
如果Ruby无法处理HTTPS,它应该抛出异常。它至少应该。也许该网站正在压缩XML,也许您需要在解析之前解压缩?查看尝试访问XML时返回的标头。如果您使用的是Firefox,请尝试使用HttpLiveHeaders。