Nokogiri:仅搜索并更新一些文本,而在某些元素名称内保留文本

时间:2018-07-07 09:18:17

标签: ruby nokogiri

输入

我有这个HTML片段:

<p>Yes.  No.  Both. Maybe a <a href="/plane">plane</a>?</p><h2 id="2-is-it-a-plane">2. Is it a plane?</h2><p>Yes.  No.  Both.</p><h2 id="3-what-is-superman-anyway">3. What is Superman, anyway?</h2><p>Is it a bird?  Is it a plane?  No, it’s Superman.</p>

所需的输出

我需要

  1. 将单词 plane 替换为<a href="/some/url">plane</a>
  2. 但仅当它位于<a></a>定位标记之外时,
  3. 在标题<h1-h6></h>标记之外。

代码

这是我尝试过的Nokogiri:

require 'Nokogiri'
h = '<p>Yes.  No.  Both. Maybe a <a href="/plane">plane</a>?</p><h2 id="2-is-it-a-plane">2. Is it a plane?</h2><p>Yes.  No.  Both.</p><h2 id="3-what-is-superman-anyway">3. What is Superman, anyway?</h2><p>Is it a bird?  Is it a plane?  No, it’s Superman.</p>'
doc = Nokogiri::HTML::DocumentFragment(h).parse

# Try 1: This outputs all content, but I need to avoid <a>/<h#>
doc.content 

# Try 2: The below line removes headings permanently - I need them to remain 
# doc.search(".//h2").remove

# Try 3: This just comes out empty - why?
# doc.xpath('text()')    
# doc.xpath('//text()')

# then,
# code to replace `plane` is here ...
# this part is not needed
# then,
doc.to_html

我尝试了xpath的其他各种变体,但均无济于事。 我在做什么错,我该如何实现目标?

1 个答案:

答案 0 :(得分:0)

经过一番游戏之后,您似乎需要使用xpath p / text()。然后事情就变得更加复杂,因为您试图用link元素替换普通文本。当我刚尝试gsub-ing时,Nokogiri正在转义新链接,因此我需要将text元素拆分为多个同级元素,在其中可以用link元素代替文本节点替换某些同级元素。您可能需要进行一些调整(我绝不是xml或nokogiri专家),但至少对我来说,它似乎正在为所提供的示例工作,所以您应该继续前进:

doc.xpath('p/text()').grep(/plane/) do |node|
  node_content, *remaining_texts = node.content.split(/(plane)/)

  node.content = node_content
  remaining_texts.each do |text|
    if text == 'plane' 
      node = node.add_next_sibling('<a href="/some/url">plane</a>').last
    else
      node = node.add_next_sibling(text).last
    end
  end
end

puts doc
# <p>Yes.  No.  Both. Maybe a <a href="/plane">plane</a>?</p>
# <h2 id="2-is-it-a-plane">2. Is it a plane?</h2>
# <p>Yes.  No.  Both.</p>
# <h2 id="3-what-is-superman-anyway">3. What is Superman, anyway?</h2>
# <p>Is it a bird?  Is it a <a href="/some/url">plane</a>?  No, it’s Superman.</p>

更通用的xpath(用于标题和链接以外的所有元素)可能是:

*[not(name()='a')][not(name()='h1')][not(name()='h2')][not(name()='h3')][not(name()='h4')][not(name()='h5')][not(name()='h6')]/text()