如何使用nokogiri在一对相同的标签之间获取HTML?

时间:2015-02-21 17:16:34

标签: html ruby xpath nokogiri

我想抓一个这样的HTML文件:

<div id="hoge">
  <h1><span>title 1</span></h1>

    <h2><span>subtitle 1-1</span></h2>
    <p></p>
    <table class="fuga"><span>data 1-1</span></table>
    <p></p>

    //(the same structure repeated n times)

    <h2><span>subtitle 1-(n+2)<span/></h2>
    <p></p>
    <table class="fuga"><span>data 1-(n+2)</span></table>
    <p></p>


  //(the same structure repeated m times)

  <h1><span>title m</span></h1>

    <h2><span>subtitle m-1</span></h2>
    <p></p>
    <table class="fuga"><span>data m-1</span></table>
    <p></p>

    //(the same structure repeated l times)

    <h2><span>subtitle m-(l+2)</span></h2>
    <p></p>
    <table class="fuga"><span>data m-(l+2)</span></table>
    <p></p>


</div>

我需要每个字幕(data x-y)的每个字幕("subtitle x-y")的表值(在示例中,在"title x"。中表示。)。 要关联它们,我想在下一个<h1>之前剪切<p>〜最后<h1>,但无法弄清楚如何操作。
我花了5个小时来搜索,阅读,尝试和错误,最后来写下面的代码,但它仍然无法工作。
怎么了?我该如何剪切HTML?

require 'open-uri'
require 'nokogiri'

doc = Nokogiri::HTML(open("http://example.com/"))

doc.xpath('//div[@id="mw-content-text"]').each do |node|
  for i in 1..node.xpath('h1').length do
    mininode = node.xpath(%(node()[not(following-sibling::h1[#{i}] or preceding-sibling::h1[#{i+1}])]))

    title = mininode.xpath('h1/span').text
    puts title unless title.empty?
    puts "============"

    for j in 1..mininode.xpath('h2').length do
      puts mininode.xpath(%(h2[#{j}]/span)).text
      puts mininode.xpath(%(table[#{j}]/span)).text
    end
  end
end

1 个答案:

答案 0 :(得分:1)

默想:

require 'nokogiri'

doc = Nokogiri::HTML(<<EOT)
<div id="hoge">
  <h1><span>title 1</span></h1>

    <h2><span>subtitle 1-1</span></h2>
    <p></p>
    <table class="fuga"><span>data 1-1</span></table>
    <p></p>

    //(the same structure repeated n times)

    <h2><span>subtitle 1-(n+2)<span/></h2>
    <p></p>
    <table class="fuga"><span>data 1-(n+2)</span></table>
    <p></p>


  //(the same structure repeated m times)

  <h1><span>title m</span></h1>

    <h2><span>subtitle m-1</span></h2>
    <p></p>
    <table class="fuga"><span>data m-1</span></table>
    <p></p>

    //(the same structure repeated l times)

    <h2><span>subtitle m-(l+2)</span></h2>
    <p></p>
    <table class="fuga"><span>data m-(l+2)</span></table>
    <p></p>


</div>
EOT

处理doc

div = doc.at('#hoge')
h1_blocks = div.children.slice_before{ |node| node.name == 'h1' }.map{ |nodes| Nokogiri::XML::NodeSet.new(doc, nodes) }

运行会导致h1_blocks包含一组NodeSet。这是基于HTML的第一组:

h1_blocks[1].map(&:to_html)
# => ["<h1><span>title 1</span></h1>",
#     "\n\n    ",
#     "<h2><span>subtitle 1-1</span></h2>",
#     "\n    ",
#     "<p></p>",
#     "\n    ",
#     "<table class=\"fuga\"><span>data 1-1</span></table>",
#     "\n    ",
#     "<p></p>",
#     "\n\n    //(the same structure repeated n times)\n\n    ",
#     "<h2><span>subtitle 1-(n+2)<span></span></span></h2>",
#     "\n    ",
#     "<p></p>",
#     "\n    ",
#     "<table class=\"fuga\"><span>data 1-(n+2)</span></table>",
#     "\n    ",
#     "<p></p>",
#     "\n\n\n  //(the same structure repeated m times)\n\n  "]

这是第二组,基于您的HTML:

h1_blocks[2].map(&:to_html)
# => ["<h1><span>title m</span></h1>",
#     "\n\n    ",
#     "<h2><span>subtitle m-1</span></h2>",
#     "\n    ",
#     "<p></p>",
#     "\n    ",
#     "<table class=\"fuga\"><span>data m-1</span></table>",
#     "\n    ",
#     "<p></p>",
#     "\n\n    //(the same structure repeated l times)\n\n    ",
#     "<h2><span>subtitle m-(l+2)</span></h2>",
#     "\n    ",
#     "<p></p>",
#     "\n    ",
#     "<table class=\"fuga\"><span>data m-(l+2)</span></table>",
#     "\n    ",
#     "<p></p>",
#     "\n\n\n"]

这是如何运作的?

Ruby的Enumerable类具有slice_before,它查看比较,并且对于每个真实结果,将传入的数组分解为新的子数组。当我们有一个数组元素列表时,这很有用,我们必须将该数组分成不同的块。

我们经常在解析具有某种重复块的文本时使用它,我们需要将这些块作为块处理,例如段落,网络设备接口等。

一旦通过取<div id="hoge">标签的子节点对节点进行分块,然后将它们传递到map,然后将它们转回NodeSet,这样可以像往常一样继续对它们进行处理在Nokogiri。