Question

我正在尝试从服务器下载帐户事务（XML文件）。当我从浏览器输入此URL时：

https://secure.somesite.com:443/my/account/download_transactions.php?type=xml

它成功下载了正确的XML文件（假设我已经登录）。

我想用Ruby编程，并尝试使用此代码：

require 'open-uri'
require 'rexml/document'
require 'net/http' 
require 'net/https'
include REXML

url = URI.parse("https://secure.somesite.com:443/my/account/download_transactions.php?type=xml")
req = Net::HTTP::Get.new(url.path)
req.basic_auth 'userid', 'password'
req.content_type = 'text/xml'

http = Net::HTTP.new(url.host, url.port)
http.use_ssl = true
response = http.start { |http| http.request(req) }

root = Document.new(response.read_body).root

root.elements.each("transaction") do |t|
   id = t.elements["id"].text
   description = t.elements["description"].text
   puts "TRANSACTION ID='#{id}' DESCRIPTION='#{description}'"
end

执行继续，但在“Document.new”上失败：

RuntimeError: Illegal character '&' in raw string "??ࡱ?;??

如果打印出来的话，返回的正文显然不是XML，并且看起来是一长串大多数不可读的，偶尔会有一个明显的单词表示它与预期的内容有关。我还看到字符串“Arial1”与不可读的内容混合了好几次，这让我觉得我收到的格式不是XML。

我的问题是，我在这里做错了什么？ XML文件绝对可用（如果您检查浏览器获取的副本，则更正）。我是否指定了SSL的错误？ HTTPS请求？是否有不同的正确方法来揭示正确的身体？在此先感谢您的帮助！

检查标题的有趣想法。成功的浏览器序列从HttpLiveHeaders显示：

https://secure.somesite.com/my/account/download_transactions.php?&type=xml

GET /my/account/download_transactions.php?type=xml HTTP/1.1
Host: secure.somesite.com
User-Agent: Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10.6; en-US; rv:1.9.1.3) Gecko/20090824 Firefox/3.5.3
Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8
Accept-Language: en-us,en;q=0.5
Accept-Encoding: gzip,deflate
Accept-Charset: ISO-8859-1,utf-8;q=0.7,*;q=0.7
Keep-Alive: 300
Connection: keep-alive
Cookie: <obscured>

HTTP/1.x 200 OK
Date: Wed, 21 Oct 2009 13:13:08 GMT
Server: Apache/2.2
Expires: Thu, 19 Nov 1981 08:52:00 GMT
Cache-Control: must-revalidate, post-check=0,pre-check=0
Pragma: public
Content-Disposition: attachment; filename=stuff.xml
Connection: close
Transfer-Encoding: chunked
Content-Type: application/xml

我试图通过将上面的“接受”切换并粘贴到我的请求中来匹配所有HTTP标头位，但返回的XML文件仍然搞砸了。

我的代码返回的响应的hexdump显示了很多00x和FFx，以及单词“root”和“entry”彼此靠近。不成功的ruby序列的WireShark转储不太有用，因为它显示了SSL编码的应用程序数据。但很明显，一大堆数据正在被退回。

START DUMP
00000000: d0 cf 11 e0 a1 b1 1a e1 - 00 00 00 00 00 00 00 00  ................
00000010: 00 00 00 00 00 00 00 00 - 3b 00 03 00 fe ff 09 00  ........;.......
00000020: 06 00 00 00 00 00 00 00 - 00 00 00 00 01 00 00 00  ................
00000030: 04 00 00 00 00 00 00 00 - 00 10 00 00 00 00 00 00  ................
00000040: 01 00 00 00 fe ff ff ff - 00 00 00 00 05 00 00 00  ................
00000050: ff ff ff ff ff ff ff ff - ff ff ff ff ff ff ff ff  ................
00000060: ff ff ff ff ff ff ff ff - ff ff ff ff ff ff ff ff  ................
00000070: ff ff ff ff ff ff ff ff - ff ff ff ff ff ff ff ff  ................
... and so on... non 00 and FF's appear much further down.

我不确定下一步该尝试什么。有什么建议吗？

Answer 1

我自己修复了这个问题。事实证明，这个特定网站似乎没有使用“基本身份验证”，我被要求执行特定的登录屏幕以生成可用的cookie。我还使用“Mechanize”简化了解决方案，这是一个处理HTTP活动大部分工作的gem。

require 'rubygems'
require 'mechanize'

login_username = "theusername"
login_password = "thepassword"

# get login page
agent = WWW::Mechanize.new
agent.user_agent_alias = 'Mac Safari'
page = agent.get('https://somesite.com/login.php')

# fill out login form and submit
form = page.forms[0] # use first form on page
form['form[username]'] = login_username
form['form[password]'] = login_password
page = agent.submit(form)

# process returned page 
if page.uri.to_s.include?("login") 
  puts '---- LOGIN FAILED ----'
else
  puts '---- LOGIN SUCCESSFUL ----'
  xml_data = agent.get('https://secure.somesite.com:443/download_transactions.php?type=xml')
  puts xml_data.body
end

给我的东西是设置表单字段的方式，由于某些原因，这些字段与我看到的这样做的例子不同。

Answer 2

如果Ruby无法处理HTTPS，它应该抛出异常。它至少应该。也许该网站正在压缩XML，也许您需要在解析之前解压缩？查看尝试访问XML时返回的标头。如果您使用的是Firefox，请尝试使用HttpLiveHeaders。

无法通过HTTPS使用Ruby获取XML数据

2 个答案: