Question

我有以下XML：

<w:body>
  <w:p w14:paraId="15812FB6" w14:textId="27A946A1" w:rsidR="001665B3" w:rsidRDefault="00771852">
    <w:r>
      <w:t xml:space="preserve">I am writing this </w:t>
    </w:r>
    <w:ins w:author="Mitchell Gould" w:date="2016-10-04T17:24:00Z" w:id="0">
      <w:r w:rsidR="00A1573E">
        <w:t>text to look</w:t>
      </w:r>
    </w:ins>
    <w:del w:author="Mitchell Gould" w:date="2016-10-04T17:24:00Z" w:id="1">
      <w:r w:rsidDel="00A1573E">
        <w:delText>to test</w:delText>
      </w:r>
    </w:del>
...

我知道我可以使用以下方式获取所有文字：

only_text_array = @file.search('//text()')

然而，我实际上想要两个文本集：

包含<w:del>...</w:del>元素中的文字以外的所有文字的文字。
包含除<w:ins>...</w:ins>元素的文本之外的所有文本的另一个文本。

我该如何做到这一点？

Answer 1

您可以尝试使用以下XPath：

//text()[not(ancestor::w:del or ancestor::w:ins)]

<强> xpatheval demo

此XPath返回所有文本节点，其中祖先不是w:del或w:ins

Answer 2

我会做这样的事情：

require 'nokogiri'

doc = Nokogiri::HTML(<<EOT)
<html>
  <body>
    <p class="ignore">foobar</p>
    <p>Keep this</p>
    <p class="ignore2">foobar2</p>
  </body>
</html>
EOT

text1, text2 = %w[.ignore .ignore2].map do |s|
  tmp_doc = doc.dup
  tmp_doc.search(s).remove
  tmp_doc.text.strip
end

text1 # => "Keep this\n    foobar2"
text2 # => "foobar\n    Keep this"

迭代不需要的事物的选择器列表，dup是文档，然后删除不需要的节点，并在稍作清理后返回文档的文本。

默认情况下，

dup会执行深层复制，因此删除节点不会影响doc。

如何使用Nokogiri获取具有特定标签的文本的所有文本？

2 个答案: