Question

我正在尝试使用Hpricot解析HTML表格但是卡住了，无法从具有指定ID的页面中选择表格元素。

这是我的红宝石代码： -

require 'rubygems'
require 'mechanize'
require 'hpricot'

agent = WWW::Mechanize.new

page = agent.get('http://www.indiapost.gov.in/pin/pinsearch.aspx')

form = page.forms.find {|f| f.name == 'form1'}
form.fields.find {|f| f.name == 'ddl_state'}.options[1].select
page = agent.submit(form, form.buttons[2])

doc = Hpricot(page.body)

puts doc.to_html # Here the doc contains the full HTML page

puts doc.search("//table[@id='gvw_offices']").first # This is NIL

任何人都可以帮我确定这个问题。

Answer 1

Mechanize将在内部使用hpricot（它是mechanize的默认解析器）。更重要的是，它会将hpricot的东西传递给解析器，所以你不必自己动手：

require 'rubygems'
require 'mechanize'

#You don't really need this if you don't use hpricot directly
require 'hpricot'

agent = WWW::Mechanize.new

page = agent.get('http://www.indiapost.gov.in/pin/pinsearch.aspx')

form = page.forms.find {|f| f.name == 'form1'}
form.fields.find {|f| f.name == 'ddl_state'}.options[1].select
page = agent.submit(form, form.buttons[2])

puts page.parser.to_html # page.parser returns the hpricot parser

puts page.at("//table[@id='gvw_offices']") # This passes through to hpricot

另请注意，page.search("foo").first相当于page.at("foo")。

Answer 2

请注意，Mechanize在更高版本（0.9.0）中默认不再使用Hpricot（它使用Nokogiri），您必须明确指定Hpricot继续使用：

  WWW::Mechanize.html_parser = Hpricot

就像那样，没有引号或Hpricot周围的任何东西 - 可能有一个模块可以为Hpricot指定，因为如果你把这个语句放在你自己的模块声明中它将无法工作。这是在课堂顶部（打开模块或课程之前）进行此操作的最佳方式

require 'mechanize'
require 'hpricot'

# Later versions of Mechanize no longer use Hpricot by default
# but have an attribute we can set to use it
begin
  WWW::Mechanize.html_parser = Hpricot
rescue NoMethodError
  # must be using an older version of Mechanize that doesn't
  # have the html_parser attribute - just ignore it since 
  # this older version will use Hpricot anyway
end

通过使用救援块，您可以确保如果它们具有较旧版本的机械化，则不会对不存在的html_parser属性进行限制。（否则，您需要使您的代码依赖于最新版本的Mechanize）

同样在最新版本中，不推荐使用WWW :: Mechanize :: List。不要问我为什么，因为它完全破坏了

等语句的向后兼容性

page.forms.name('form1').first

以前是一个常见的习惯用法，因为Page＃表单返回了一个具有“名称”方法的机械化列表。现在它返回一个简单的Forms数组。

我发现这很难，但是你的用法会有效，因为你使用find这是一种数组方法。

但找到具有给定名称的第一个表单的更好方法是Page#form，因此您的表单查找行变为

form = page.form('form1')

此方法适用于旧版本。

使用Hpricot（Ruby）解析HTML表

2 个答案: