Question

我对使用Xpath非常陌生。我正在尝试从法律法规网站上提取一些信息，现在我只想：

找到一个包含字符串“ Article 1”的标签。
从（1）开始获取该标签，然后获取所有内容，直到其中一个标签在<b>标签中包含另一个字符串“ PRIME Ministry”。

<p>
  <b> <span> Article 1. </span> </b> 
  <span> 
     To approve the master plan on development 
     of tourism in Northern Central Vietnam 
     with the following principal contents: 
  </span>
</p>

<p>
  <span>
    1. Development viewpoints
  </span>
</p>

<p>
  <span>To realize general viewpoints of the strategy for and master plan on development of Vietnam’s tourism through 2020.
  </span>
</p>

<p>
  <span>PRIME MINISTER: Nguyen Tan Dung</span>
</p>

<p>
  <span>
    <b> PRIME MINISTER </b>
  </span>
</p>

<p>
  <b> <span> Article 2. </span> </b> 
  <span> 
     .................
  </span>
</p>

<p>
  <span> PRIME MINISTER: Nguyen Tan Dung</span>
</p>

预期的输出，我应该有一个类似于

的列表

[ 
'Article 1.' , 
  'To approve the master plan on development of tourism in Northern 
   Central Vietnam with the following principal contents: ',
  '1. Development viewpoints' ,
  'To realize general viewpoints of the strategy for and master plan on 
   development of Vietnam’s tourism through 2020.' ,
  'PRIME MINISTER: Nguyen Tan Dung',
  'PRIME MINISTER'
]

列表中的第一项是“第1条”。并且列表中的最后一项是<b>标签内的“ PRIME MINISTER”

Answer 1

即使在XPath版本高于1.0的情况下，在XPath中“ Until”和“ Between”查询也非常困难。

如果我们从更高版本开始工作，则可以在XPath 3.1中执行以下操作：

let $first := p[contains(., 'Article 1')],
    $last := p[contains(., 'PRIME MINISTER']
return $first, p[. >> $first and . << $last], $last

在XPath 2.0中，我们没有let，但是for的效果也很好，只是有点奇怪。

但是在1.0中（a）我们不能绑定变量，并且（b）我们没有<<和>>运算符，这使它变得更加困难。

最简单的表达可能是

p[(.|preceding-sibling::p)[contains(., 'Article 1')] and 
  (.|following-sibling::p)[contains(., 'PRIME MINISTER')]]

不幸的是，如果没有令人难以置信的智能优化器，那么对于大型输入文档而言，效率可能非常低（contains（）测试将被执行（N ^ 2）/ 2次，其中N是段数）。如果您受限于XPath 1.0，那么最好使用XPath查找“开始”和“结束”节点，然后使用宿主语言查找介于两者之间的所有节点。

Answer 2

此xpath表达式：

//p[descendant-or-self::p and (following-sibling::p/descendant::b)]

至少应该在您发布的html代码上获得预期的输出。

Answer 3

这是与OP中确切要求匹配的xpath。

//span[normalize-space(.)='Article 1.']/ancestor::p|//p[//span[normalize-space(.)='Article 1.']]/following::*[count(following-sibling::p/span/b[normalize-space(.)='PRIME MINISTER'])=1]

屏幕截图：

Answer 4

一个简单的XPath 1.0表达式：

 /*/p[starts-with(normalize-space(), 'Article 1.')]
     [1]
    | /*/p[starts-with(normalize-space(), 'Article 1.')]
          [1]/following-sibling::p
             [not(preceding-sibling::p[starts-with(normalize-space(), 'PRIME MINISTER')])
             and
               following-sibling::p[starts-with(normalize-space(), 'PRIME MINISTER')]
             and not(starts-with(normalize-space(), 'PRIME MINISTER'))
             ]

针对此XML文档进行评估：

<html>
<p>
  <b> <span> Article 1. </span> </b>
  <span>
     To approve the master plan on development
     of tourism in Northern Central Vietnam
     with the following principal contents:
  </span>
</p>

<p>
  <span>
    1. Development viewpoints
  </span>
</p>

<p>
  <span>To realize general viewpoints of the strategy for and master plan on development of Vietnam’s tourism through 2020.
  </span>
</p>

<p>
  <span>PRIME MINISTER: Nguyen Tan Dung</span>
</p>

<p>
  <span>
    <b> PRIME MINISTER </b>
  </span>
</p>

<p>
  <b> <span> Article 2. </span> </b>
  <span>
     .................
  </span>
</p>

<p>
  <span> PRIME MINISTER: Nguyen Tan Dung</span>
</p>
</html>

它完全选择了所需的<p>元素。

验证：

此XSLT转换对XPath表达式求值并输出在此求值中选择的所有节点：

<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
 <xsl:output omit-xml-declaration="yes" indent="yes"/>

  <xsl:template match="/">
    <xsl:copy-of select=
    "/*/p[starts-with(normalize-space(), 'Article 1.')]
         [1]
        | /*/p[starts-with(normalize-space(), 'Article 1.')]
              [1]/following-sibling::p
                 [not(preceding-sibling::p[starts-with(normalize-space(), 'PRIME MINISTER')])
                 and
                   following-sibling::p[starts-with(normalize-space(), 'PRIME MINISTER')]
                 and not(starts-with(normalize-space(), 'PRIME MINISTER'))
                 ]
    "/>
  </xsl:template>
</xsl:stylesheet>

当应用于相同的XML文档（如上）时，会产生所需的结果：

<p>
   <b>
      <span> Article 1. </span>
   </b>
   <span>
     To approve the master plan on development
     of tourism in Northern Central Vietnam
     with the following principal contents:
  </span>
</p>
<p>
   <span>
    1. Development viewpoints
  </span>
</p>
<p>
   <span>To realize general viewpoints of the strategy for and master plan on development of Vietnam’s tourism through 2020.
  </span>
</p>

，并在浏览器中按预期显示：

第1条。 批准发展总体规划越南中北部旅游业主要内容如下：

1.发展观点

了解到2020年越南旅游业发展战略的总体观点和总体规划。

Xpath获取具有特定字符串及其后续同级标签的标签，直到标签中包含另一个特定字符串

4 个答案: