我想通过使用以下Ruby代码和Nokogiri找到一种获取HTML结果的方法(下面将进一步提到):
require 'rubygems'
require 'nokogiri'
value = Nokogiri::HTML.parse(<<-HTML_END)
"<html>
<body>
<p id='1'>A</p>
<p id='2'>B</p>
<h1>Bla</h1>
<p id='3'>C</p>
<p id='4'>D</p>
<p id='5'>E</p>
</body>
</html>"
HTML_END
# The selected-array is given by the application.
# It consists of a sorted array with all ids of
# <p> that need to be enclosed by the <div>
selected = ["2","3","4"]
first_p = selected.first
last_p = selected.last
#
# WHAT RUBY CODE DO I NEED TO INSERT HERE TO GET
# THE RESULTING HTML AS SEEN BELOW?
#
生成的HTML应如下所示(请注意插入的<div id='XYZ'>
):
<html>
<body>
<p id='1'>A</p>
<div id='XYZ'>
<p id='2'>B</p>
<h1>Bla</h1>
<p id='3'>C</p>
<p id='4'>D</p>
</div>
<p id='5'>E</p>
</body>
</html>
答案 0 :(得分:4)
在这些情况下,您通常希望使用底层库提供给您的SAX interface,以有状态和连续的方式遍历和重写输入XML(或XHTML):
require 'nokogiri'
require 'CGI'
Nokogiri::XML::SAX::Parser.new(
Class.new(Nokogiri::XML::SAX::Document) {
def initialize first_p, last_p
@first_p, @last_p = first_p, last_p
end
def start_document
puts '<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">'
end
def start_element name, attrs = []
attrs = Hash[*attrs]
@depth += 1 unless @depth.nil?
print '<div>' if name=='p' && attrs['id'] == @first_p
@depth = 1 if name=='p' && attrs['id'] == @last_p && @depth.nil?
print "<#{ [ name, attrs.collect { |k,v| "#{k}=\"#{CGI::escapeHTML(v)}\"" } ].flatten.join(' ') }>"
end
def end_element name
@depth -= 1 unless @depth.nil?
print "</#{name}>"
if @depth == 0
print '</div>'
@depth = nil
end
end
def cdata_block string
print "<![CDATA[#{CGI::escapeHTML(string)}]]>"
end
def characters string
print CGI::escapeHTML(string)
end
def comment string
print "<!--#{string}-->"
end
}.new('2', '4')
).parse(<<-HTML_END)
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">
<html>
<body>
<!-- comment -->
<![CDATA[
cdata goes here
]]>
"special" entities
<p id="1">A</p>
<p id="2">B</p>
<p id="3">C</p>
<p id="4">D</p>
<p id="5">E</p>
<emptytag/>
</body>
</html>
HTML_END
或者,您也可以使用DOM model interface(而不是SAX接口)将整个文档加载到内存中(与您在原始问题中开始执行的方式相同),然后执行节点操作(插入和删除)如下:
require 'rubygems'
require 'nokogiri'
doc = Nokogiri::HTML.parse(<<-HTML_END)
<html>
<body>
<p id='1'>A</p>
<p id='2'>B</p>
<p id='3'>C</p>
<p id='4'>D</p>
<p id='5'>E</p>
</body>
</html>
HTML_END
first_p = "2"
last_p = "4"
doc.css("p[id=\"#{first_p}\"] ~ p[id=\"#{last_p}\"]").each { |node|
div_node = nil
node.parent.children.each { |sibling_node|
if sibling_node.name == 'p' && sibling_node['id'] == first_p
div_node = Nokogiri::XML::Node.new('div', doc)
sibling_node.add_previous_sibling(div_node)
end
unless div_node.nil?
sibling_node.remove
div_node << sibling_node
end
if sibling_node.name == 'p' && sibling_node['id'] == last_p
div_node = nil
end
}
}
puts doc
答案 1 :(得分:1)
这是我在项目中实施的工作解决方案(Vlad @ SO&amp; Whitelist @ irc#rubyonrails:感谢您的帮助和灵感。):
require 'rubygems'
require 'nokogiri'
value = Nokogiri::HTML.parse(<<-HTML_END)
"<html>
<body>
<p id='1'>A</p>
<p id='2'>B</p>
<h1>Bla</h1>
<p id='3'>C</p>
<p id='4'>D</p>
<p id='5'>E</p>
</body>
</html>"
HTML_END
# The selected-array is given by the application.
# It consists of a sorted array with all ids of
# <p> that need to be enclosed by the <div>
selected = ["2","3","4"]
# We want an elements, not nodesets!
# .first returns Nokogiri::XML::Element instead of Nokogiri::XML::nodeset
first_p = value.css("p##{selected.first}").first
last_p = value.css("p##{selected.last}").first
parent = value.css('body').first
# build and set new div_node
div_node = Nokogiri::XML::Node.new('div', value)
div_node['class'] = 'XYZ'
# add div_node before first_p
first_p.add_previous_sibling(div_node)
selected_node = false
parent.children.each do |tag|
# if it's the first_p
selected_node = true if selected.include? tag['id']
# if it's anything between the first_p and the last_p
div_node.add_child(tag) if selected_node
# if it's the last_p
selected_node = false if selected.last == tag['id']
end
puts value.to_html