Question

如何在保留<p>标记

的同时从<a>中提取文字

<p>
  Some <a href="http://somewhere.com">link</a> going somewhere.
  <ul>
    <li><a href="http://lowendbox.com/">Low end</a></li>
  </ul>
  Some trailing text.
</p>

预期输出：

Some <a href="http://somewhere.com">link</a> going somewhere.
<a href="http://lowendbox.com/">Low end</a>
Some trailing text.

我能想到的唯一解决方案是覆盖Nokogiri text方法并递归children，希望能找到一些简单的解决方案。

Answer 1

你不能在ul内部p，所以任何解析为html4或html5的尝试都会失败。这留下了正则表达式，这可以很容易地解决这个问题：

str = <<EOF
<p>
  Some <a href="http://somewhere.com">link</a> going somewhere.
  <ul>
    <li><a href="http://lowendbox.com/">Low end</a></li>
  </ul>
  Some trailing text.
</p>
EOF
puts str.gsub(/<\/?(p|ul|li)>/,'')

#  Some <a href="http://somewhere.com">link</a> going somewhere.
#
#    <a href="http://lowendbox.com/">Low end</a>
#
#  Some trailing text.

使用Nokogiri提取文本保留链接

1 个答案: