Question

我试图清理一些CMS输入的HTML，其中包含无关的段落标记和br标签。事实证明，Sanitize gem对此非常有用，但我遇到了一个特定的问题。

问题是在段落标记之后/之前直接存在br标记，例如

<p>
  <br />
  Some text here
  <br />
  Some more text
  <br />
</p>

我想删除无关的第一个和最后一个br标签，但不是中间标签。

我非常希望我可以使用消毒变压器来做到这一点，但似乎无法找到合适的匹配器来实现这一目标。

非常感谢任何帮助。

Answer 1

以下是<br>所包含的特定<p>节点的定位方式：

require 'nokogiri'

doc = Nokogiri::HTML::DocumentFragment.parse(<<EOT)
<p>
  <br />
  Some text here
  <br />
  Some more text
  <br />
</p>
EOT

doc.search('p > br').map(&:to_html)
# => ["<br>", "<br>", "<br>"]

一旦我们知道我们可以找到它们，就可以轻松删除特定的内容：

br_nodes = doc.search('p > br')
br_nodes.first.remove
br_nodes.last.remove
doc.to_html
# => "<p>\n  \n  Some text here\n  <br>\n  Some more text\n  \n</p>\n"

请注意，Nokogiri删除了它们，但是它们的关联Text节点是它们的直接兄弟节点，包含它们的＆＃34; \ n＆＃34;留下了。一个浏览器会吞噬它们而不显示行尾，但你可能会感觉强迫症，所以这里也是如何删除它们：

br_nodes = doc.search('p > br')
[br_nodes.first, br_nodes.last].each do |br|
  br.next_sibling.remove
  br.remove
end
doc.to_html
# => "<p>\n  <br>\n  Some more text\n  </p>\n"

Answer 2

initial_linebreak_transformer = lambda {|options|
  node = options[:node]
  if node.present? && node.element? && node.name.downcase == 'p'
    first_child = node.children.first
    if first_child.name.downcase == 'br'
      first_child.unlink
      initial_linebreak_transformer.call options
    end
  end
}

使用Nokogiri清理HTML

2 个答案: