使用Nokogiri提取所有表行中的前2个表格单元格

时间:2012-02-13 14:48:11

标签: ruby xpath nokogiri

我有一张桌子,想要使用Nokogiri来提取每个表格行中前两个单元格的内容。目前我面临一些困难,希望得到你的帮助。这就是我现在所得到的。任何人都可以帮助我吗?感谢。

irb(main):001:0> require 'nokogiri'
=> true
irb(main):002:0>
irb(main):003:0* @doc = Nokogiri::HTML::DocumentFragment.parse <<-EOHTML
irb(main):004:0" <body>
irb(main):005:0" <div class="c">
irb(main):006:0" <table>
irb(main):007:0"     <tr>
irb(main):008:0"         <td>test</td><td>test</td><td>test</td><td>test</td>
irb(main):009:0"     </tr>
irb(main):010:0"     <tr class="even">
irb(main):011:0"         <td>test</td><td>test</td><td>test</td><td>test</td>
irb(main):012:0"     </tr>
irb(main):013:0"     <tr>
irb(main):014:0"         <td>test</td><td>test</td><td>test</td><td>test</td>
irb(main):015:0"     </tr>
irb(main):016:0"     <tr class="even">
irb(main):017:0"         <td>test</td><td>test</td><td>test</td><td>test</td>
irb(main):018:0"     </tr>
irb(main):019:0" </table>
irb(main):020:0" </div>
irb(main):021:0" </body>
irb(main):022:0" EOHTML
irb(main):026:0> @doc.css("div.c > table").search("table/tr/td")
=> ...
irb(main):026:0> @doc.css("div.c > table").search("table/tr/td[position()>2]")
Nokogiri::CSS::SyntaxError: unexpected '>' after '#<Nokogiri::CSS::Node:0x2b7bc20>'
        from C:/RailsInstaller/Ruby1.9.2/lib/ruby/gems/1.9.1/gems/nokogiri-1.5.0-x86-mingw32/lib/nokogiri/css/parser_extras.rb:87:in `on_error'
        from C:/RailsInstaller/Ruby1.9.2/lib/ruby/1.9.1/racc/parser.rb:99:in `_racc_do_parse_c'
        from C:/RailsInstaller/Ruby1.9.2/lib/ruby/1.9.1/racc/parser.rb:99:in `do_parse'
        from C:/RailsInstaller/Ruby1.9.2/lib/ruby/gems/1.9.1/gems/nokogiri-1.5.0-x86-mingw32/lib/nokogiri/css/parser_extras.rb:62:in `parse'
        from C:/RailsInstaller/Ruby1.9.2/lib/ruby/gems/1.9.1/gems/nokogiri-1.5.0-x86-mingw32/lib/nokogiri/css/parser_extras.rb:79:in `xpath_for'
        from C:/RailsInstaller/Ruby1.9.2/lib/ruby/gems/1.9.1/gems/nokogiri-1.5.0-x86-mingw32/lib/nokogiri/css.rb:23:in `xpath_for'
        from C:/RailsInstaller/Ruby1.9.2/lib/ruby/gems/1.9.1/gems/nokogiri-1.5.0-x86-mingw32/lib/nokogiri/xml/node_set.rb:111:in `block (2 levels) in
css'
        from C:/RailsInstaller/Ruby1.9.2/lib/ruby/gems/1.9.1/gems/nokogiri-1.5.0-x86-mingw32/lib/nokogiri/xml/node_set.rb:109:in `map'
        from C:/RailsInstaller/Ruby1.9.2/lib/ruby/gems/1.9.1/gems/nokogiri-1.5.0-x86-mingw32/lib/nokogiri/xml/node_set.rb:109:in `block in css'
        from C:/RailsInstaller/Ruby1.9.2/lib/ruby/gems/1.9.1/gems/nokogiri-1.5.0-x86-mingw32/lib/nokogiri/xml/node_set.rb:239:in `block in each'
        from C:/RailsInstaller/Ruby1.9.2/lib/ruby/gems/1.9.1/gems/nokogiri-1.5.0-x86-mingw32/lib/nokogiri/xml/node_set.rb:238:in `upto'
        from C:/RailsInstaller/Ruby1.9.2/lib/ruby/gems/1.9.1/gems/nokogiri-1.5.0-x86-mingw32/lib/nokogiri/xml/node_set.rb:238:in `each'
        from C:/RailsInstaller/Ruby1.9.2/lib/ruby/gems/1.9.1/gems/nokogiri-1.5.0-x86-mingw32/lib/nokogiri/xml/node_set.rb:105:in `css'
        from C:/RailsInstaller/Ruby1.9.2/lib/ruby/gems/1.9.1/gems/nokogiri-1.5.0-x86-mingw32/lib/nokogiri/xml/node_set.rb:83:in `block in search'
        from C:/RailsInstaller/Ruby1.9.2/lib/ruby/gems/1.9.1/gems/nokogiri-1.5.0-x86-mingw32/lib/nokogiri/xml/node_set.rb:80:in `each'
        from C:/RailsInstaller/Ruby1.9.2/lib/ruby/gems/1.9.1/gems/nokogiri-1.5.0-x86-mingw32/lib/nokogiri/xml/node_set.rb:80:in `search'
        from (irb):27
        from C:/RailsInstaller/Ruby1.9.2/bin/irb:12:in `<main>'irb(main):028:0>

3 个答案:

答案 0 :(得分:3)

使用XPath查询:

@doc.xpath('//table/tr/td[1] | //table/tr/td[2]')

这将返回td节点中具有tr节点作为父节点的第一个和第二个table节点。

答案 1 :(得分:3)

由于您在评论中表示您希望保留同一行中单元格的逻辑关系:

@doc.css('div.c > table > tr').each do |tr|
  td1, td2 = tr.xpath('./td') # Find only direct child items
  # td1 is the first <td>, td2 the second
end

如果您想一次有效地提取所有文本:

data = @doc.css('tr').map do |row|
  # Find the text for all td, get the first two, then join with ' - '
  row.xpath('./td').map(&:text)[0,2].join(' - ')
end

puts data
#=> a1 - b1
#=> a2 - b2
#=> a3 - b3
#=> a4 - b4

以上输出来自比所有“测试”更有趣的测试数据。

答案 2 :(得分:1)

我建议使用SAX解析器

class ShowtimeDaily < Nokogiri::XML::SAX::Document
  attr_reader :td_count
  def start_element name, attrs =[]
  case name
  when 'tr'
    @td_count = 0
  when 'td'
    @td_count +=1
  end

 def characters string
   # string containts the content you'd be requiring
   puts "content of row number #{@td_count}: #{string}" if @td_count < 3
 end

我编写的代码很可能在其中有错误,因为我没有验证它。我希望它能很好地解决你的问题。