我有这个HTML片段:
<p>Yes. No. Both. Maybe a <a href="/plane">plane</a>?</p><h2 id="2-is-it-a-plane">2. Is it a plane?</h2><p>Yes. No. Both.</p><h2 id="3-what-is-superman-anyway">3. What is Superman, anyway?</h2><p>Is it a bird? Is it a plane? No, it’s Superman.</p>
我需要
plane
替换为<a href="/some/url">plane</a>
,<a></a>
定位标记之外时,<h1-h6></h>
标记之外。这是我尝试过的Nokogiri:
require 'Nokogiri'
h = '<p>Yes. No. Both. Maybe a <a href="/plane">plane</a>?</p><h2 id="2-is-it-a-plane">2. Is it a plane?</h2><p>Yes. No. Both.</p><h2 id="3-what-is-superman-anyway">3. What is Superman, anyway?</h2><p>Is it a bird? Is it a plane? No, it’s Superman.</p>'
doc = Nokogiri::HTML::DocumentFragment(h).parse
# Try 1: This outputs all content, but I need to avoid <a>/<h#>
doc.content
# Try 2: The below line removes headings permanently - I need them to remain
# doc.search(".//h2").remove
# Try 3: This just comes out empty - why?
# doc.xpath('text()')
# doc.xpath('//text()')
# then,
# code to replace `plane` is here ...
# this part is not needed
# then,
doc.to_html
我尝试了xpath
的其他各种变体,但均无济于事。 我在做什么错,我该如何实现目标?
答案 0 :(得分:0)
经过一番游戏之后,您似乎需要使用xpath p / text()。然后事情就变得更加复杂,因为您试图用link元素替换普通文本。当我刚尝试gsub
-ing时,Nokogiri正在转义新链接,因此我需要将text元素拆分为多个同级元素,在其中可以用link元素代替文本节点替换某些同级元素。您可能需要进行一些调整(我绝不是xml或nokogiri专家),但至少对我来说,它似乎正在为所提供的示例工作,所以您应该继续前进:
doc.xpath('p/text()').grep(/plane/) do |node|
node_content, *remaining_texts = node.content.split(/(plane)/)
node.content = node_content
remaining_texts.each do |text|
if text == 'plane'
node = node.add_next_sibling('<a href="/some/url">plane</a>').last
else
node = node.add_next_sibling(text).last
end
end
end
puts doc
# <p>Yes. No. Both. Maybe a <a href="/plane">plane</a>?</p>
# <h2 id="2-is-it-a-plane">2. Is it a plane?</h2>
# <p>Yes. No. Both.</p>
# <h2 id="3-what-is-superman-anyway">3. What is Superman, anyway?</h2>
# <p>Is it a bird? Is it a <a href="/some/url">plane</a>? No, it’s Superman.</p>
更通用的xpath(用于标题和链接以外的所有元素)可能是:
*[not(name()='a')][not(name()='h1')][not(name()='h2')][not(name()='h3')][not(name()='h4')][not(name()='h5')][not(name()='h6')]/text()