Question

我想抓一个这样的HTML文件：

<div id="hoge">
  <h1><span>title 1</span></h1>

    <h2><span>subtitle 1-1</span></h2>
    <p></p>
    <table class="fuga"><span>data 1-1</span></table>
    <p></p>

    //(the same structure repeated n times)

    <h2><span>subtitle 1-(n+2)<span/></h2>
    <p></p>
    <table class="fuga"><span>data 1-(n+2)</span></table>
    <p></p>


  //(the same structure repeated m times)

  <h1><span>title m</span></h1>

    <h2><span>subtitle m-1</span></h2>
    <p></p>
    <table class="fuga"><span>data m-1</span></table>
    <p></p>

    //(the same structure repeated l times)

    <h2><span>subtitle m-(l+2)</span></h2>
    <p></p>
    <table class="fuga"><span>data m-(l+2)</span></table>
    <p></p>


</div>

我需要每个字幕（data x-y）的每个字幕（"subtitle x-y"）的表值（在示例中，在"title x"。中表示。）。要关联它们，我想在下一个<h1>之前剪切<p>〜最后<h1>，但无法弄清楚如何操作。
我花了5个小时来搜索，阅读，尝试和错误，最后来写下面的代码，但它仍然无法工作。
怎么了？我该如何剪切HTML？

require 'open-uri'
require 'nokogiri'

doc = Nokogiri::HTML(open("http://example.com/"))

doc.xpath('//div[@id="mw-content-text"]').each do |node|
  for i in 1..node.xpath('h1').length do
    mininode = node.xpath(%(node()[not(following-sibling::h1[#{i}] or preceding-sibling::h1[#{i+1}])]))

    title = mininode.xpath('h1/span').text
    puts title unless title.empty?
    puts "============"

    for j in 1..mininode.xpath('h2').length do
      puts mininode.xpath(%(h2[#{j}]/span)).text
      puts mininode.xpath(%(table[#{j}]/span)).text
    end
  end
end

Answer 1

默想：

require 'nokogiri'

doc = Nokogiri::HTML(<<EOT)
<div id="hoge">
  <h1><span>title 1</span></h1>

    <h2><span>subtitle 1-1</span></h2>
    <p></p>
    <table class="fuga"><span>data 1-1</span></table>
    <p></p>

    //(the same structure repeated n times)

    <h2><span>subtitle 1-(n+2)<span/></h2>
    <p></p>
    <table class="fuga"><span>data 1-(n+2)</span></table>
    <p></p>


  //(the same structure repeated m times)

  <h1><span>title m</span></h1>

    <h2><span>subtitle m-1</span></h2>
    <p></p>
    <table class="fuga"><span>data m-1</span></table>
    <p></p>

    //(the same structure repeated l times)

    <h2><span>subtitle m-(l+2)</span></h2>
    <p></p>
    <table class="fuga"><span>data m-(l+2)</span></table>
    <p></p>


</div>
EOT

处理doc：

div = doc.at('#hoge')
h1_blocks = div.children.slice_before{ |node| node.name == 'h1' }.map{ |nodes| Nokogiri::XML::NodeSet.new(doc, nodes) }

运行会导致h1_blocks包含一组NodeSet。这是基于HTML的第一组：

h1_blocks[1].map(&:to_html)
# => ["<h1><span>title 1</span></h1>",
#     "\n\n    ",
#     "<h2><span>subtitle 1-1</span></h2>",
#     "\n    ",
#     "<p></p>",
#     "\n    ",
#     "<table class=\"fuga\"><span>data 1-1</span></table>",
#     "\n    ",
#     "<p></p>",
#     "\n\n    //(the same structure repeated n times)\n\n    ",
#     "<h2><span>subtitle 1-(n+2)<span></span></span></h2>",
#     "\n    ",
#     "<p></p>",
#     "\n    ",
#     "<table class=\"fuga\"><span>data 1-(n+2)</span></table>",
#     "\n    ",
#     "<p></p>",
#     "\n\n\n  //(the same structure repeated m times)\n\n  "]

这是第二组，基于您的HTML：

h1_blocks[2].map(&:to_html)
# => ["<h1><span>title m</span></h1>",
#     "\n\n    ",
#     "<h2><span>subtitle m-1</span></h2>",
#     "\n    ",
#     "<p></p>",
#     "\n    ",
#     "<table class=\"fuga\"><span>data m-1</span></table>",
#     "\n    ",
#     "<p></p>",
#     "\n\n    //(the same structure repeated l times)\n\n    ",
#     "<h2><span>subtitle m-(l+2)</span></h2>",
#     "\n    ",
#     "<p></p>",
#     "\n    ",
#     "<table class=\"fuga\"><span>data m-(l+2)</span></table>",
#     "\n    ",
#     "<p></p>",
#     "\n\n\n"]

这是如何运作的？

Ruby的Enumerable类具有slice_before，它查看比较，并且对于每个真实结果，将传入的数组分解为新的子数组。当我们有一个数组元素列表时，这很有用，我们必须将该数组分成不同的块。

我们经常在解析具有某种重复块的文本时使用它，我们需要将这些块作为块处理，例如段落，网络设备接口等。

一旦通过取<div id="hoge">标签的子节点对节点进行分块，然后将它们传递到map，然后将它们转回NodeSet，这样可以像往常一样继续对它们进行处理在Nokogiri。

如何使用nokogiri在一对相同的标签之间获取HTML？

1 个答案: