Xpath获取具有特定字符串及其后续同级标签的标签,直到标签中包含另一个特定字符串

时间:2019-06-09 20:30:21

标签: python xpath web-scraping scrapy

我对使用Xpath非常陌生。我正在尝试从法律法规网站上提取一些信息,现在我只想:

  1. 找到一个包含字符串“ Article 1”的标签。
  2. 从(1)开始获取该标签,然后获取所有内容,直到其中一个标签在<b>标签中包含另一个字符串“ PRIME Ministry”。
<p>
  <b> <span> Article 1. </span> </b> 
  <span> 
     To approve the master plan on development 
     of tourism in Northern Central Vietnam 
     with the following principal contents: 
  </span>
</p>

<p>
  <span>
    1. Development viewpoints
  </span>
</p>

<p>
  <span>To realize general viewpoints of the strategy for and master plan on development of Vietnam’s tourism through 2020.
  </span>
</p>

<p>
  <span>PRIME MINISTER: Nguyen Tan Dung</span>
</p>

<p>
  <span>
    <b> PRIME MINISTER </b>
  </span>
</p>

<p>
  <b> <span> Article 2. </span> </b> 
  <span> 
     .................
  </span>
</p>

<p>
  <span> PRIME MINISTER: Nguyen Tan Dung</span>
</p>

预期的输出,我应该有一个类似于

的列表
[ 
'Article 1.' , 
  'To approve the master plan on development of tourism in Northern 
   Central Vietnam with the following principal contents: ',
  '1. Development viewpoints' ,
  'To realize general viewpoints of the strategy for and master plan on 
   development of Vietnam’s tourism through 2020.' ,
  'PRIME MINISTER: Nguyen Tan Dung',
  'PRIME MINISTER'
]

列表中的第一项是“第1条”。并且列表中的最后一项是<b>标签内的“ PRIME MINISTER”

4 个答案:

答案 0 :(得分:3)

即使在XPath版本高于1.0的情况下,在XPath中“ Until”和“ Between”查询也非常困难。

如果我们从更高版本开始工作,则可以在XPath 3.1中执行以下操作:

let $first := p[contains(., 'Article 1')],
    $last := p[contains(., 'PRIME MINISTER']
return $first, p[. >> $first and . << $last], $last

在XPath 2.0中,我们没有let,但是for的效果也很好,只是有点奇怪。

但是在1.0中(a)我们不能绑定变量,并且(b)我们没有<<>>运算符,这使它变得更加困难。

最简单的表达可能是

p[(.|preceding-sibling::p)[contains(., 'Article 1')] and 
  (.|following-sibling::p)[contains(., 'PRIME MINISTER')]]

不幸的是,如果没有令人难以置信的智能优化器,那么对于大型输入文档而言,效率可能非常低(contains()测试将被执行(N ^ 2)/ 2次,其中N是段数)。如果您受限于XPath 1.0,那么最好使用XPath查找“开始”和“结束”节点,然后使用宿主语言查找介于两者之间的所有节点。

答案 1 :(得分:0)

此xpath表达式:

//p[descendant-or-self::p and (following-sibling::p/descendant::b)]

至少应该在您发布的html代码上获得预期的输出。

答案 2 :(得分:0)

这是与OP中确切要求匹配的xpath。

//span[normalize-space(.)='Article 1.']/ancestor::p|//p[//span[normalize-space(.)='Article 1.']]/following::*[count(following-sibling::p/span/b[normalize-space(.)='PRIME MINISTER'])=1]

屏幕截图:

date

答案 3 :(得分:0)

一个简单的XPath 1.0表达式

 /*/p[starts-with(normalize-space(), 'Article 1.')]
     [1]
    | /*/p[starts-with(normalize-space(), 'Article 1.')]
          [1]/following-sibling::p
             [not(preceding-sibling::p[starts-with(normalize-space(), 'PRIME MINISTER')])
             and
               following-sibling::p[starts-with(normalize-space(), 'PRIME MINISTER')]
             and not(starts-with(normalize-space(), 'PRIME MINISTER'))
             ]

针对此XML文档进行评估

<html>
<p>
  <b> <span> Article 1. </span> </b>
  <span>
     To approve the master plan on development
     of tourism in Northern Central Vietnam
     with the following principal contents:
  </span>
</p>

<p>
  <span>
    1. Development viewpoints
  </span>
</p>

<p>
  <span>To realize general viewpoints of the strategy for and master plan on development of Vietnam’s tourism through 2020.
  </span>
</p>

<p>
  <span>PRIME MINISTER: Nguyen Tan Dung</span>
</p>

<p>
  <span>
    <b> PRIME MINISTER </b>
  </span>
</p>

<p>
  <b> <span> Article 2. </span> </b>
  <span>
     .................
  </span>
</p>

<p>
  <span> PRIME MINISTER: Nguyen Tan Dung</span>
</p>
</html>

它完全选择了所需的<p>元素。

验证

此XSLT转换对XPath表达式求值并输出在此求值中选择的所有节点:

<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
 <xsl:output omit-xml-declaration="yes" indent="yes"/>

  <xsl:template match="/">
    <xsl:copy-of select=
    "/*/p[starts-with(normalize-space(), 'Article 1.')]
         [1]
        | /*/p[starts-with(normalize-space(), 'Article 1.')]
              [1]/following-sibling::p
                 [not(preceding-sibling::p[starts-with(normalize-space(), 'PRIME MINISTER')])
                 and
                   following-sibling::p[starts-with(normalize-space(), 'PRIME MINISTER')]
                 and not(starts-with(normalize-space(), 'PRIME MINISTER'))
                 ]
    "/>
  </xsl:template>
</xsl:stylesheet>

当应用于相同的XML文档(如上)时,会产生所需的结果

<p>
   <b>
      <span> Article 1. </span>
   </b>
   <span>
     To approve the master plan on development
     of tourism in Northern Central Vietnam
     with the following principal contents:
  </span>
</p>
<p>
   <span>
    1. Development viewpoints
  </span>
</p>
<p>
   <span>To realize general viewpoints of the strategy for and master plan on development of Vietnam’s tourism through 2020.
  </span>
</p>

,并在浏览器中按预期显示

           第1条。              批准发展总体规划      越南中北部旅游业      主要内容如下:   

        1.发展观点   

   了解到2020年越南旅游业发展战略的总体观点和总体规划。