如何使用Mechanize或Nokogiri从字符串中解析表单

时间:2017-04-12 08:19:45

标签: ruby nokogiri mechanize

我需要解析表单以从我得到的HTML中获取`IW_SessionID_的值,这是我无法工作的。

#!/usr/bin/ruby

require 'pp'
require 'nokogiri'
require 'mechanize'

r = '<HTML><HEAD><TITLE></TITLE><meta http-equiv=\"cache-control\" content=\"no-cache\">\r\n<meta http-equiv=\"pragma\" content=\"no-cache\">\r\n<NOSCRIPT><HTML><BODY>Your browser does not seem to support JavaScript. Please make sure it is supported and activated</BODY></HTML></NOSCRIPT>\r\n<SCRIPT>\r\nvar ie4 = (document.all)? true:false;\r\nvar ns6 = (document.getElementById)? true && !ie4:false;\r\nfunction Initialize() {\r\nvar lWidth;\r\nvar lHeight;\r\nif (ns6) {\r\n  lWidth = window.innerWidth - 30;\r\n  lHeight = window.innerHeight - 30;\r\n} else {\r\n   lWidth = document.body.clientWidth;\r\n   lHeight = document.body.clientHeight;\r\n   if (lWidth == 0) { lWidth = undefined;}\r\n   if (lHeight == 0) { lHeight = undefined;}\r\n}\r\ndocument.forms[0].elements[\"IW_width\"].value = lWidth;\r\ndocument.forms[0].elements[\"IW_height\"].value = lHeight;\r\ndocument.forms[0].submit();\r\n}</SCRIPT></HEAD><BODY onload=\"Initialize()\">\r\n<form method=post action=\"/bwtem\">\r\n<input type=hidden name=\"IW_width\">\r\n<input type=hidden name=\"IW_height\">\r\n<input type=hidden name=\"IW_SessionID_\" value=\"1wqzj1f0vec57r1apfqg51wzs88c\">\r\n<input type=hidden name=\"IW_TrackID_\" value=\"0\">\r\n</form></BODY></HTML>'

page = Nokogiri::HTML r
puts page.css('form[name="IW_SessionID_"]')

a = Mechanize.new
page2 = Mechanize::Page.new(nil,{'content-type'=>'text/html'},r,nil,a)

pp page2.form_with(:name => "IW_SessionID_")

该脚本只返回nil

有人能弄明白如何获得IW_SessionID_的价值吗?

2 个答案:

答案 0 :(得分:0)

您必须浏览示例HTML字符串,然后使用名称IW_SessionID_搜索输入

此示例代码适用于我:

#!/usr/bin/ruby

require 'pp'
require 'nokogiri'
require 'mechanize'

r = '<HTML><HEAD><TITLE></TITLE><meta http-equiv="cache-control" content="no-cache">\r\n<meta http-equiv="pragma" content="no-cache">\r\n<NOSCRIPT><HTML><BODY>Your browser does not seem to support JavaScript. Please make sure it is supported and activated</BODY></HTML></NOSCRIPT>\r\n<SCRIPT>\r\nvar ie4 = (document.all)? true:false;\r\nvar ns6 = (document.getElementById)? true && !ie4:false;\r\nfunction Initialize() {\r\nvar lWidth;\r\nvar lHeight;\r\nif (ns6) {\r\n  lWidth = window.innerWidth - 30;\r\n  lHeight = window.innerHeight - 30;\r\n} else {\r\n   lWidth = document.body.clientWidth;\r\n   lHeight = document.body.clientHeight;\r\n   if (lWidth == 0) { lWidth = undefined;}\r\n   if (lHeight == 0) { lHeight = undefined;}\r\n}\r\ndocument.forms[0].elements["IW_width"].value = lWidth;\r\ndocument.forms[0].elements["IW_height"].value = lHeight;\r\ndocument.forms[0].submit();\r\n}</SCRIPT></HEAD><BODY onload="Initialize()">\r\n<form method=post action="/bwtem">\r\n<input type=hidden name="IW_width">\r\n<input type=hidden name="IW_height">\r\n<input type=hidden name="IW_SessionID_" value="1wqzj1f0vec57r1apfqg51wzs88c">\r\n<input type=hidden name="IW_TrackID_" value="0">\r\n</form></BODY></HTML>'

page = Nokogiri::HTML r
input = page.css('input[name="IW_SessionID_"]').first
puts input[:value]

答案 1 :(得分:0)

熟悉这些工具后,这很容易做到:

require 'nokogiri'

doc = Nokogiri::HTML(DATA.read)

doc.at('input[name="IW_SessionID_"]')['value']
# => "1wqzj1f0vec57r1apfqg51wzs88c"

__END__
<HTML>
  <BODY>
    <form method=post action="/bwtem">
      <input type=hidden name="IW_height">
      <input type=hidden name="IW_SessionID_" value="1wqzj1f0vec57r1apfqg51wzs88c">
      <input type=hidden name="IW_TrackID_" value="0">
    </form>
  </BODY>
</HTML>

不要做以下事情:

page.css('form[name="IW_SessionID_"]')

css用于搜索与选择器匹配的多个元素。表单不太可能具有多个具有相同名称的隐藏输入,因此at会更加明智。 css返回一个NodeSet,它类似于一个节点数组,因此不像节点那样:

require 'nokogiri'

doc = Nokogiri::HTML(<<EOT)
<html>
  <body>
    <p>foo</p>
    <p>bar</p>
  </body>
</html>
EOT

doc.search('p').class # => Nokogiri::XML::NodeSet
doc.at('p').class # => Nokogiri::XML::Element

text将连接NodeSet中的文本元素,导致混乱:

doc.search('p').text # => "foobar"

而使用map(&:text)将迭代返回其文本的节点:

doc.search('p').map(&:text) # => ["foo", "bar"]

另请注意,css(...).firstsearch(...).firstat或其at_*兄弟之一相同:

doc.search('p').first.to_html # => "<p>foo</p>"
doc.at('p').to_html # => "<p>foo</p>"

为清晰起见,请使用at代替search(...).first

最后,将您的HTML示例剥离到最低限度,以证明您所询问的问题。除此之外的任何事情都会浪费空间和时间,因为我们正试图理解这个问题。