除了使用Nokogiri的某些元素外,如何删除某个节点下的所有标签? 例如,使用此设置:
src = <<EOS
<html>
<body>
<p>
Hello <i>world</i>!
This is <em>another</em> line.
<p><h3>And a paragraph <em>with</em> a heading.</h3></p>
<b>Third line.</b>
</p>
</body>
</html>
EOS
doc = Nokogiri::HTML(src)
para = doc.at('//p')
除了&lt; i&gt;之外,如何删除段落中的所有元素(同时保留其内容)和&lt; b&gt;元素? 结果将是:
<html>
<body>
<p>
Hello <i>world</i>!
This is another line.
And a paragraph with a heading.
<b>Third line.</b>
</p>
</body>
</html>
答案 0 :(得分:4)
为了完善这些例子,这里有一个使用没有XSLT的Nokogiri:
require 'nokogiri'
src = <<EOS
<html>
<body>
<p>
Hello <i>world</i>!
This is <em>another</em> line.
<p><h3>And a paragraph <em>with</em> a heading.</h3></p>
<b>Third line.</b>
</p>
</body>
</html>
EOS
doc = Nokogiri::HTML(src)
if (doc.errors.any?)
puts "doc.errors:"
doc.errors.each do |e|
puts "#{ e.line }: #{ e.to_s }"
end
puts
end
doc.search('//p/*').each do |n|
n.replace(n.content) unless (%w[i b].include?(n.name))
end
puts doc.to_html
# >> doc.errors:
# >> 6: Unexpected end tag : p
# >> 8: Unexpected end tag : p
# >>
# >> <!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">
# >> <html><body>
# >> <p>
# >> Hello <i>world</i>!
# >> This is another line.
# >> </p>
# >> <p></p>
# >> <h3>And a paragraph <em>with</em> a heading.</h3>
# >> <b>Third line.</b>
# >>
# >> </body></html>
请注意,Nokogiri对标记不满意并做了一些修复。而且,剥离标签的实际代码只有三行,可以写在一行上。
答案 1 :(得分:3)
Flack使用XSLT模板给出了正确的答案,我在这里提供了一个完整的Nokogiri示例:
xslt = <<EOS
<stylesheet version="1.0" xmlns="http://www.w3.org/1999/XSL/Transform">
<output method="html" indent="yes"/>
<template match="node() | @*">
<copy>
<apply-templates select="node() | @*"/>
</copy>
</template>
<template match="p//*[not(self::i or self::b)]">
<apply-templates/>
</template>
</stylesheet>
EOS
src = <<EOS
<html>
<body>
<p>
Hello <i>world</i>!
This is <em>another</em> line.
<p><h3>And a paragraph <em>with</em> a heading.</h3></p>
<b>Third line.</b>
</p>
</body>
</html>
EOS
doc = Nokogiri::XML(src)
paragraph = doc.at('p')
xslt = Nokogiri::XSLT(xslt)
transformed_paragraph = xslt.transform(paragraph)
paragraph.replace transformed_paragraph.children
puts doc
输出:
<?xml version="1.0"?>
<html>
<body>
<p>
Hello <i>world</i>!
This is another line.
And a paragraph with a heading.
<b>Third line.</b>
</p>
</body>
</html>
答案 2 :(得分:0)
<xsl:stylesheet version="1.0"
xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
<xsl:output method="html" indent="yes"/>
<xsl:template match="node() | @*">
<xsl:copy>
<xsl:apply-templates select="node() | @*"/>
</xsl:copy>
</xsl:template>
<xsl:template match="em | p/p | h3">
<xsl:apply-templates/>
</xsl:template>
</xsl:stylesheet>
应用于您的样本,结果将是:
<html>
<body>
<p>
Hello
<i>world</i>!
This is another line.
And a paragraph with a heading.
<b>Third line.</b>
</p>
</body>
</html>
按照评论中的要求进行编辑。
<xsl:stylesheet version="1.0"
xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
<xsl:output method="html" indent="yes"/>
<xsl:template match="node() | @*">
<xsl:copy>
<xsl:apply-templates select="node() | @*"/>
</xsl:copy>
</xsl:template>
<xsl:template match="p//*[not(self::i or self::b)]">
<xsl:apply-templates/>
</xsl:template>
</xsl:stylesheet>
除p
和i
元素外,这将删除b
中的所有元素(标记,而不是字符串值)。