XPATH排除多个元素/标记

时间:2013-04-20 02:53:47

标签: xpath html-parsing

我在尝试在XML中的两个div标记之间提取文本时遇到问题。

想象一下,我有以下XML:

<div class="default_style_wrap" >

<!-- Body starts -->
    <!-- Irrelvent Data -->
    <div style="clear:both" />
    <!-- Irrelvent Data -->
    <div class="name_address" >...</div>
    <!-- Irrelvent Data -->
    <div style="clear:both" />
    <!-- Irrelvent Data -->
    <span class="img_comments_right" >...</span>

    <!-- Text that I want to get -->
Two members of the Expedition 35 crew wrapped up a 6-hour, 38 minute spacewalk at 4:41 p.m. EDT Friday to deploy and retrieve several science experiments on the exterior of the International Space Station and install a new navigational aid.
    <br />
    <br />
The spacewalkers' first task was to install the Obstanovka experiment on the station's Zvezda service module. Obstanovka will study plasma waves and the effect of space weather on Earth's ionosphere.

    <!-- Irrelvent Data Again -->
    <span class="img_comments_right" >...</span>
    <!-- Text that I want to get -->
After deploying a pair of sensor booms for Obstanovka, Vinogradov and Romanenko retrieved the Biorisk experiment from the exterior of Pirs. The Biorisk experiment studied the effect of microbes on spacecraft structures.
    <br />
    <br />
This was the 167th spacewalk in support of space station assembly and maintenance, totaling 1,055 hours, 39 minutes. Vinogradov's seven spacewalks total 38 hours, 25 minutes. Romanenko completed his first spacewalk.
    <!-- Body ends -->
</div>

由于代码中可能没有反映,default_style_wrap是所有其他不相关的divsspans的父级。我的相关文本基本上都是所有无标记文本,但正如您所看到的那样,其中有其他标记,例如img_comments_right,这让我感到疯狂。

我在另一篇文章中看到了以下内容:

"//div[@class='article_container']/*[not(self::div)]";

但似乎根本没有返回任何文本,即使它确实如此,我也不知道如何排除spans

有什么想法吗?

4 个答案:

答案 0 :(得分:0)

您应该尝试以下查询。它选择<span>节点的所有后续兄弟节点,它们是文本节点

query = '//span[@class="img_comments_right"]/following-sibling::text()';

答案 1 :(得分:0)

您可以使用此xpath:

//div[@class='default_style_wrap']/text()

答案 2 :(得分:0)

您应该能够使用此XPath获取文本:

div[@class = 'default_style_wrap']/text()[normalize-space()]

它选择所有text()节点作为* default_style_wrap * <div>的子节点,过滤掉空(或仅空白)节点。

如果您使用单独的模板,您可以将每个块整齐地放在它自己的段落中,例如:

<xsl:template match="/">
    <xsl:apply-templates select="div[@class = 'default_style_wrap']/text()[normalize-space()]" />
</xsl:template>

<xsl:template match="text()">
    <p><xsl:value-of select="." /></p>
</xsl:template>

答案 3 :(得分:0)

解决方案:

您可以使用 or 运算符为 not 运算符指定多个条件 像这样:

not(expr1 or expr2)

因此您可以添加 self::span 作为 not 的另一个条件以将它们从结果中排除;

//div[@class='default_style_wrap']/*[not(self::div or self::span)]

PS:div 标签关闭不当似乎存在问题。以适当的方式关闭它们。