我有一张桌子,想要使用Nokogiri来提取每个表格行中前两个单元格的内容。目前我面临一些困难,希望得到你的帮助。这就是我现在所得到的。任何人都可以帮助我吗?感谢。
irb(main):001:0> require 'nokogiri'
=> true
irb(main):002:0>
irb(main):003:0* @doc = Nokogiri::HTML::DocumentFragment.parse <<-EOHTML
irb(main):004:0" <body>
irb(main):005:0" <div class="c">
irb(main):006:0" <table>
irb(main):007:0" <tr>
irb(main):008:0" <td>test</td><td>test</td><td>test</td><td>test</td>
irb(main):009:0" </tr>
irb(main):010:0" <tr class="even">
irb(main):011:0" <td>test</td><td>test</td><td>test</td><td>test</td>
irb(main):012:0" </tr>
irb(main):013:0" <tr>
irb(main):014:0" <td>test</td><td>test</td><td>test</td><td>test</td>
irb(main):015:0" </tr>
irb(main):016:0" <tr class="even">
irb(main):017:0" <td>test</td><td>test</td><td>test</td><td>test</td>
irb(main):018:0" </tr>
irb(main):019:0" </table>
irb(main):020:0" </div>
irb(main):021:0" </body>
irb(main):022:0" EOHTML
irb(main):026:0> @doc.css("div.c > table").search("table/tr/td")
=> ...
irb(main):026:0> @doc.css("div.c > table").search("table/tr/td[position()>2]")
Nokogiri::CSS::SyntaxError: unexpected '>' after '#<Nokogiri::CSS::Node:0x2b7bc20>'
from C:/RailsInstaller/Ruby1.9.2/lib/ruby/gems/1.9.1/gems/nokogiri-1.5.0-x86-mingw32/lib/nokogiri/css/parser_extras.rb:87:in `on_error'
from C:/RailsInstaller/Ruby1.9.2/lib/ruby/1.9.1/racc/parser.rb:99:in `_racc_do_parse_c'
from C:/RailsInstaller/Ruby1.9.2/lib/ruby/1.9.1/racc/parser.rb:99:in `do_parse'
from C:/RailsInstaller/Ruby1.9.2/lib/ruby/gems/1.9.1/gems/nokogiri-1.5.0-x86-mingw32/lib/nokogiri/css/parser_extras.rb:62:in `parse'
from C:/RailsInstaller/Ruby1.9.2/lib/ruby/gems/1.9.1/gems/nokogiri-1.5.0-x86-mingw32/lib/nokogiri/css/parser_extras.rb:79:in `xpath_for'
from C:/RailsInstaller/Ruby1.9.2/lib/ruby/gems/1.9.1/gems/nokogiri-1.5.0-x86-mingw32/lib/nokogiri/css.rb:23:in `xpath_for'
from C:/RailsInstaller/Ruby1.9.2/lib/ruby/gems/1.9.1/gems/nokogiri-1.5.0-x86-mingw32/lib/nokogiri/xml/node_set.rb:111:in `block (2 levels) in
css'
from C:/RailsInstaller/Ruby1.9.2/lib/ruby/gems/1.9.1/gems/nokogiri-1.5.0-x86-mingw32/lib/nokogiri/xml/node_set.rb:109:in `map'
from C:/RailsInstaller/Ruby1.9.2/lib/ruby/gems/1.9.1/gems/nokogiri-1.5.0-x86-mingw32/lib/nokogiri/xml/node_set.rb:109:in `block in css'
from C:/RailsInstaller/Ruby1.9.2/lib/ruby/gems/1.9.1/gems/nokogiri-1.5.0-x86-mingw32/lib/nokogiri/xml/node_set.rb:239:in `block in each'
from C:/RailsInstaller/Ruby1.9.2/lib/ruby/gems/1.9.1/gems/nokogiri-1.5.0-x86-mingw32/lib/nokogiri/xml/node_set.rb:238:in `upto'
from C:/RailsInstaller/Ruby1.9.2/lib/ruby/gems/1.9.1/gems/nokogiri-1.5.0-x86-mingw32/lib/nokogiri/xml/node_set.rb:238:in `each'
from C:/RailsInstaller/Ruby1.9.2/lib/ruby/gems/1.9.1/gems/nokogiri-1.5.0-x86-mingw32/lib/nokogiri/xml/node_set.rb:105:in `css'
from C:/RailsInstaller/Ruby1.9.2/lib/ruby/gems/1.9.1/gems/nokogiri-1.5.0-x86-mingw32/lib/nokogiri/xml/node_set.rb:83:in `block in search'
from C:/RailsInstaller/Ruby1.9.2/lib/ruby/gems/1.9.1/gems/nokogiri-1.5.0-x86-mingw32/lib/nokogiri/xml/node_set.rb:80:in `each'
from C:/RailsInstaller/Ruby1.9.2/lib/ruby/gems/1.9.1/gems/nokogiri-1.5.0-x86-mingw32/lib/nokogiri/xml/node_set.rb:80:in `search'
from (irb):27
from C:/RailsInstaller/Ruby1.9.2/bin/irb:12:in `<main>'irb(main):028:0>
答案 0 :(得分:3)
使用XPath查询:
@doc.xpath('//table/tr/td[1] | //table/tr/td[2]')
这将返回td
节点中具有tr
节点作为父节点的第一个和第二个table
节点。
答案 1 :(得分:3)
由于您在评论中表示您希望保留同一行中单元格的逻辑关系:
@doc.css('div.c > table > tr').each do |tr|
td1, td2 = tr.xpath('./td') # Find only direct child items
# td1 is the first <td>, td2 the second
end
如果您想一次有效地提取所有文本:
data = @doc.css('tr').map do |row|
# Find the text for all td, get the first two, then join with ' - '
row.xpath('./td').map(&:text)[0,2].join(' - ')
end
puts data
#=> a1 - b1
#=> a2 - b2
#=> a3 - b3
#=> a4 - b4
以上输出来自比所有“测试”更有趣的测试数据。
答案 2 :(得分:1)
我建议使用SAX解析器
class ShowtimeDaily < Nokogiri::XML::SAX::Document
attr_reader :td_count
def start_element name, attrs =[]
case name
when 'tr'
@td_count = 0
when 'td'
@td_count +=1
end
def characters string
# string containts the content you'd be requiring
puts "content of row number #{@td_count}: #{string}" if @td_count < 3
end
我编写的代码很可能在其中有错误,因为我没有验证它。我希望它能很好地解决你的问题。