我想从某些HTML中删除标记而不删除任何标记的剩余内容。例如,我有一个文件test.html:
<p class="P1"><span class="T2">Some text, goes to uppercase</span>
<p class="P4"><span class="T4"> </span><span class="T3">other text</span>
<span class="T5">italics</span><span class="T3">‘more text with UTF-8 ’</span>
</p></p>
我想得到以下输出:
SOME TEXT, GOES TO UPPERCASE
other text
<em>italics<em> ‘more text with UTF-8 ’
我的代码是:
f = File.open('raw/test.html',"r")
doc = Nokogiri::XML::DocumentFragment.parse(f.read.encode('UTF-8'))
f.close
doc.css("span.T2").each do |span|
span.replace span.content.upcase
end
doc.css("span.T5").each do |span|
span.replace "<em>"+span.content+"</em>"
end
doc.css("span").each do |span|
span.replace span.content
end
doc.css("p").each do |p|
p.replace Nokogiri::XML::Text.new(p.inner_html, p.document)
end
f = File.open('processed/test.html',"w")
f.write(doc)
f.close
我得到的输出是:
SOME TEXT, GOES TO UPPERCASE
<p class="P4">
other text
<em>italics </em>&#x2018;more text with UTF-8 &#x2019;
&#x2018;our common mother&#x2019;
</p>
非常感谢提前。
解决方案如下:
coder = HTMLEntities.new
f = File.open('raw/test.html',"r")
doc = Nokogiri::XML::DocumentFragment.parse(f.read.encode('UTF-8'))
f.close
doc.css("p").each do |p|
p.replace p.inner_html
end
doc.css("span.T2").each do |span|
span.replace span.content.upcase
end
doc.css("span.T5").each do |span|
span.replace "<em>"+span.content+"</em>"
end
doc.css("span").each do |span|
span.replace span.inner_html
end
f = File.open('processed/test.html',"w")
f.write(coder.decode(doc))
f.close
答案 0 :(得分:1)
使用span.replace "<em>"+span.content+"</em>"
不正确。您需要告诉Nokogiri用HTML替换,而不是文本。例如:
span.inner_html = "<em>"+span.content+"</em>"
结果:
<span class="T5"><em>italics</em></span>