Question

输入

我有这个HTML片段：

<p>Yes.  No.  Both. Maybe a <a href="/plane">plane</a>?</p><h2 id="2-is-it-a-plane">2. Is it a plane?</h2><p>Yes.  No.  Both.</p><h2 id="3-what-is-superman-anyway">3. What is Superman, anyway?</h2><p>Is it a bird?  Is it a plane?  No, it’s Superman.</p>

所需的输出

我需要

将单词 plane 替换为<a href="/some/url">plane</a>，
但仅当它位于<a></a>定位标记之外时，
在标题<h1-h6></h>标记之外。

代码

这是我尝试过的Nokogiri：

require 'Nokogiri'
h = '<p>Yes.  No.  Both. Maybe a <a href="/plane">plane</a>?</p><h2 id="2-is-it-a-plane">2. Is it a plane?</h2><p>Yes.  No.  Both.</p><h2 id="3-what-is-superman-anyway">3. What is Superman, anyway?</h2><p>Is it a bird?  Is it a plane?  No, it’s Superman.</p>'
doc = Nokogiri::HTML::DocumentFragment(h).parse

# Try 1: This outputs all content, but I need to avoid <a>/<h#>
doc.content 

# Try 2: The below line removes headings permanently - I need them to remain 
# doc.search(".//h2").remove

# Try 3: This just comes out empty - why?
# doc.xpath('text()')    
# doc.xpath('//text()')

# then,
# code to replace `plane` is here ...
# this part is not needed
# then,
doc.to_html

我尝试了xpath的其他各种变体，但均无济于事。 我在做什么错，我该如何实现目标？

Answer 1

经过一番游戏之后，您似乎需要使用xpath p / text（）。然后事情就变得更加复杂，因为您试图用link元素替换普通文本。当我刚尝试gsub-ing时，Nokogiri正在转义新链接，因此我需要将text元素拆分为多个同级元素，在其中可以用link元素代替文本节点替换某些同级元素。您可能需要进行一些调整（我绝不是xml或nokogiri专家），但至少对我来说，它似乎正在为所提供的示例工作，所以您应该继续前进：

doc.xpath('p/text()').grep(/plane/) do |node|
  node_content, *remaining_texts = node.content.split(/(plane)/)

  node.content = node_content
  remaining_texts.each do |text|
    if text == 'plane' 
      node = node.add_next_sibling('<a href="/some/url">plane</a>').last
    else
      node = node.add_next_sibling(text).last
    end
  end
end

puts doc
# <p>Yes.  No.  Both. Maybe a <a href="/plane">plane</a>?</p>
# <h2 id="2-is-it-a-plane">2. Is it a plane?</h2>
# <p>Yes.  No.  Both.</p>
# <h2 id="3-what-is-superman-anyway">3. What is Superman, anyway?</h2>
# <p>Is it a bird?  Is it a <a href="/some/url">plane</a>?  No, it’s Superman.</p>

更通用的xpath（用于标题和链接以外的所有元素）可能是：

*[not(name()='a')][not(name()='h1')][not(name()='h2')][not(name()='h3')][not(name()='h4')][not(name()='h5')][not(name()='h6')]/text()

Nokogiri：仅搜索并更新一些文本，而在某些元素名称内保留文本

输入

所需的输出

代码

1 个答案: