我想抓一个这样的HTML文件:
<div id="hoge">
<h1><span>title 1</span></h1>
<h2><span>subtitle 1-1</span></h2>
<p></p>
<table class="fuga"><span>data 1-1</span></table>
<p></p>
//(the same structure repeated n times)
<h2><span>subtitle 1-(n+2)<span/></h2>
<p></p>
<table class="fuga"><span>data 1-(n+2)</span></table>
<p></p>
//(the same structure repeated m times)
<h1><span>title m</span></h1>
<h2><span>subtitle m-1</span></h2>
<p></p>
<table class="fuga"><span>data m-1</span></table>
<p></p>
//(the same structure repeated l times)
<h2><span>subtitle m-(l+2)</span></h2>
<p></p>
<table class="fuga"><span>data m-(l+2)</span></table>
<p></p>
</div>
我需要每个字幕(data x-y
)的每个字幕("subtitle x-y"
)的表值(在示例中,在"title x"
。中表示。)。
要关联它们,我想在下一个<h1>
之前剪切<p>
〜最后<h1>
,但无法弄清楚如何操作。
我花了5个小时来搜索,阅读,尝试和错误,最后来写下面的代码,但它仍然无法工作。
怎么了?我该如何剪切HTML?
require 'open-uri'
require 'nokogiri'
doc = Nokogiri::HTML(open("http://example.com/"))
doc.xpath('//div[@id="mw-content-text"]').each do |node|
for i in 1..node.xpath('h1').length do
mininode = node.xpath(%(node()[not(following-sibling::h1[#{i}] or preceding-sibling::h1[#{i+1}])]))
title = mininode.xpath('h1/span').text
puts title unless title.empty?
puts "============"
for j in 1..mininode.xpath('h2').length do
puts mininode.xpath(%(h2[#{j}]/span)).text
puts mininode.xpath(%(table[#{j}]/span)).text
end
end
end
答案 0 :(得分:1)
默想:
require 'nokogiri'
doc = Nokogiri::HTML(<<EOT)
<div id="hoge">
<h1><span>title 1</span></h1>
<h2><span>subtitle 1-1</span></h2>
<p></p>
<table class="fuga"><span>data 1-1</span></table>
<p></p>
//(the same structure repeated n times)
<h2><span>subtitle 1-(n+2)<span/></h2>
<p></p>
<table class="fuga"><span>data 1-(n+2)</span></table>
<p></p>
//(the same structure repeated m times)
<h1><span>title m</span></h1>
<h2><span>subtitle m-1</span></h2>
<p></p>
<table class="fuga"><span>data m-1</span></table>
<p></p>
//(the same structure repeated l times)
<h2><span>subtitle m-(l+2)</span></h2>
<p></p>
<table class="fuga"><span>data m-(l+2)</span></table>
<p></p>
</div>
EOT
处理doc
:
div = doc.at('#hoge')
h1_blocks = div.children.slice_before{ |node| node.name == 'h1' }.map{ |nodes| Nokogiri::XML::NodeSet.new(doc, nodes) }
运行会导致h1_blocks
包含一组NodeSet。这是基于HTML的第一组:
h1_blocks[1].map(&:to_html)
# => ["<h1><span>title 1</span></h1>",
# "\n\n ",
# "<h2><span>subtitle 1-1</span></h2>",
# "\n ",
# "<p></p>",
# "\n ",
# "<table class=\"fuga\"><span>data 1-1</span></table>",
# "\n ",
# "<p></p>",
# "\n\n //(the same structure repeated n times)\n\n ",
# "<h2><span>subtitle 1-(n+2)<span></span></span></h2>",
# "\n ",
# "<p></p>",
# "\n ",
# "<table class=\"fuga\"><span>data 1-(n+2)</span></table>",
# "\n ",
# "<p></p>",
# "\n\n\n //(the same structure repeated m times)\n\n "]
这是第二组,基于您的HTML:
h1_blocks[2].map(&:to_html)
# => ["<h1><span>title m</span></h1>",
# "\n\n ",
# "<h2><span>subtitle m-1</span></h2>",
# "\n ",
# "<p></p>",
# "\n ",
# "<table class=\"fuga\"><span>data m-1</span></table>",
# "\n ",
# "<p></p>",
# "\n\n //(the same structure repeated l times)\n\n ",
# "<h2><span>subtitle m-(l+2)</span></h2>",
# "\n ",
# "<p></p>",
# "\n ",
# "<table class=\"fuga\"><span>data m-(l+2)</span></table>",
# "\n ",
# "<p></p>",
# "\n\n\n"]
这是如何运作的?
Ruby的Enumerable类具有slice_before
,它查看比较,并且对于每个真实结果,将传入的数组分解为新的子数组。当我们有一个数组元素列表时,这很有用,我们必须将该数组分成不同的块。
我们经常在解析具有某种重复块的文本时使用它,我们需要将这些块作为块处理,例如段落,网络设备接口等。
一旦通过取<div id="hoge">
标签的子节点对节点进行分块,然后将它们传递到map
,然后将它们转回NodeSet,这样可以像往常一样继续对它们进行处理在Nokogiri。