XPath可以在无效标记中任意出现文本后定位节点吗?

时间:2012-06-25 14:14:50

标签: html xpath screen-scraping web-scraping

我有一个顽皮的网络开发者写的文件,看起来像:

<div id="details">
    Here is some text without a p tag. Oh, let's write some more.
    <br>
    <br>
    And some more.
    <table id="non-unique">
        ...
    </table>
    Replaces the following numbers:
    <table id="non-unique">
        ... good stuff in here
    </table>
</div>

所以,它并没有很好地标记。我需要掌握其中包含好东西的表,但是,它没有唯一的id值它并不总是在同一个顺序中,或者在div中等等。

唯一正在运行的主题是它始终跟在文本Replaces the following numbers:之后,尽管此文本可能与上面的示例中一样,或者有时在h4元素中!

是否可以通过搜索替换字符串然后请求下一个表元素来使用XPath表达式来解决这个表?

谢谢!

3 个答案:

答案 0 :(得分:1)

这看起来对我有用:

//text()[contains(.,"Replaces the following numbers")]/following-sibling::table[1]

没有规定id必须是唯一的。

答案 1 :(得分:1)

使用

//node()[self::h4 or self::text()]
         [normalize-space() = 'Replaces the following numbers:']
           /following-sibling::*[1][self::table]

基于XSLT的验证

<xsl:stylesheet version="1.0"
 xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
 <xsl:output omit-xml-declaration="yes" indent="yes"/>

 <xsl:template match="/">
  <xsl:copy-of select=
   "//node()[self::h4 or self::text()]
             [normalize-space() = 'Replaces the following numbers:']
               /following-sibling::*[1][self::table]
   "/>
 </xsl:template>
</xsl:stylesheet>

对提供的文档应用此转换(更正为格式良好的XML文档):

<div id="details">
 Here is some text without a p tag. Oh, let's write some more.
    <br />
    <br />
    And some more.     
    <table id="non-unique">
     ...
  </table>
  Replaces the following numbers:
    <table id="non-unique">
    ... good stuff in here
    </table>
</div>

评估XPath表达式并将选定的节点复制到输出中:

<table id="non-unique">
    ... good stuff in here
    </table>

在此XML文档上应用相同的转换(XPath表达式)时:

<div id="details">
 Here is some text without a p tag. Oh, let's write some more.
    <br />
    <br />
    And some more.     
    <table id="non-unique">
     ...
  </table>
  <h4>Replaces the following numbers:</h4>
    <table id="non-unique">
    ... good stuff in here
    </table>
</div>

再次选择并输出所需元素:

<table id="non-unique">
    ... good stuff in here
    </table>

答案 2 :(得分:-1)

不,因为XPath需要运行良好的Xml。

比照this answer,提供了一些额外的信息。