用Nokogiri刮刮多个桌排兄弟姐妹

时间:2015-03-15 01:36:35

标签: ruby nokogiri

我正在尝试使用以下标记解析表。

<table>
  <tr class="athlete">
    <td colspan="2" class="name">Alex</td>
  </tr>
  <tr class="run">
    <td>5.00</td>
    <td>10.00</td>
  </tr>
  <tr class="run">
    <td>5.20</td>
    <td>10.50</td>
  </tr>
  <tr class="end"></tr>
  <tr class="athlete">
    <td colspan="2" class="name">John</td>
  </tr>
  <tr class="run">
    <td>5.00</td>
    <td>10.00</td>
  </tr>
  <tr class="end"></tr>
</table>

我需要循环遍历每个.athlete表行并获取下面的每个兄弟.run表行,直到我到达.end行。然后重复下一位运动员,依此类推。一些.athlete行有两个.run行,其他行有一行。

这是我到目前为止所拥有的。我穿过运动员:

require 'rubygems'
require 'nokogiri'
require 'open-uri'

url = "http://myurl.com"
doc = Nokogiri::HTML(open(url))

doc.css(".athlete").each do |athlete|
  puts athlete.at_css("name").text
  # Loop through the sibling .run rows until I reach the .end row
  # output the value of the td’s in the .run row
end

我无法弄清楚如何获得每个兄弟.run行,并停在.end行。我觉得如果表格形成得更好会更容易,但不幸的是我无法控制标记。任何帮助将不胜感激!

2 个答案:

答案 0 :(得分:0)

我会按照以下方式处理该表:

  1. 找到要处理的表格

    table = doc.at_css("table")
    
  2. 获取表格中的所有直接行

    rows = table.css("> tr")
    
  3. 将具有边界.athlete.end

    的行分组
    grouped = [[]]
    rows.each do |row|
      if row['class'] == 'athlete' and grouped.last.empty?
        grouped.last << row
      elsif row['class'] == 'end' and not grouped.last.empty?
        grouped.last << row
        grouped << []
      elsif not grouped.last.empty?
        grouped.last << row
      end
    end
    grouped.pop if grouped.last.empty? || grouped.last.last['class'] != 'end'
    
  4. 处理分组的行

    grouped.each do |group|
      puts "BEGIN: >> #{group.first.text} <<"
      group[1..-2].each do |row|
        puts "  #{row.text.squeeze}"
      end
      puts "END: >> #{group.last.text} <<"
    end
    

答案 1 :(得分:0)

require 'nokogiri'

doc = <<DOC
<table>
  <tr class="athlete">
    <td colspan="2" class="name">Alex</td>
  </tr>
  <tr class="run">
    <td>5.00</td>
    <td>10.00</td>
  </tr>
  <tr class="run">
    <td>5.20</td>
    <td>10.50</td>
  </tr>
  <tr class="end"></tr>
  <tr class="athlete">
    <td colspan="2" class="name">John</td>
  </tr>
  <tr class="run">
    <td>5.00</td>
    <td>10.00</td>
  </tr>
  <tr class="end"></tr>
</table>
DOC

doc = Nokogiri::HTML(doc)
# You can exclude .end, if it is always empty? and not required
trs = doc.css('.athlete, .run, .end').to_a
# This will return [['athlete', 'run', ...,'end'], ['athlete', 'run', ...,'end'] ...]
athletes = trs.slice_before{ |elm| elm.attr('class') =='athlete' }.to_a

athletes.map! do |athlete|
    {
        name: athlete.shift.at_css('.name').text,
        runs: athlete
        .select{ |tr| tr.attr('class') == 'run' }
        .map{|run| run.text.to_f }
    }
end

puts athletes.inspect
#[{:name=>"Alex", :runs=>[5.0, 5.2]}, {:name=>"John", :runs=>[5.0]}]