使用XSLT函数删除除允许标记之外的所有html标记

时间:2017-03-10 11:02:54

标签: xml xslt replace strip-tags

我试图清理一些使用XSLT从rss feed获取的数据。我想删除除p标签以外的所有标签。

 Cows are kool.<p>The <i>milk</i> <b>costs</b> $1.99.</p>

我对如何使用1.0或2.0中的XSLT解决此问题几乎没有疑问。

1)我见过这个例子https://maulikdhorajia.blogspot.in/2011/06/removing-html-tags-using-xslt.html

但是我需要存在p标签,我需要使用正则表达式。我们可以使用字符串匹配前函数并以类似的方式执行。我认为这个函数在xpath中不存在。

2)我知道替换函数不能用于此,因为它需要一个字符串,如果我们传递任何节点,它会提取内容,然后将其传递给函数,在这种情况下,无法删除标记。

我很困惑,因为在这个答案中,使用了替换https://stackoverflow.com/a/18528749/745018

3)我使用xslt在nginx服务器上执行此操作。

请在rss Feed的body标签中找到以下示例输入。

<p>The Supreme Court issued on Friday a bailable warrant against sitting Calcutta high court justice CS Karnan, an unprecedented order in a bitter confrontation between the judge and the top court.</p><p>A seven-judge bench headed by Chief Justice of India JS Khehar issued the order directing Karnan’s presence on <h2>March 31</h2> because the judge ignored an earlier court order summoning him.<i>Justice Karnan</i> had to appear</p>

更新:此外我正在为此

寻找xslt函数

1 个答案:

答案 0 :(得分:4)

假设您可以使用XSLT 2.0,那么您可以将David Carlisle的HTML解析器(https://github.com/davidcarlisle/web-xslt/blob/master/htmlparse/htmlparse.xsl)应用于body元素的内容,然后以删除除{{{{}之外的每个元素的模式处理结果节点1}}元素:

p

输入

<?xml version="1.0" encoding="UTF-8"?>
<xsl:transform xmlns:xsl="http://www.w3.org/1999/XSL/Transform" version="2.0"
    xmlns:d="data:,dpc"
    xmlns:xhtml="http://www.w3.org/1999/xhtml"
    exclude-result-prefixes="d xhtml">

    <xsl:import href="htmlparse-by-dcarlisle.xsl"/>

    <xsl:template match="@*|node()" mode="#default strip">
        <xsl:copy>
            <xsl:apply-templates select="@*|node()" mode="#current"/>
        </xsl:copy>
    </xsl:template>

    <xsl:template match="body">
        <xsl:copy>
            <xsl:apply-templates select="d:htmlparse(., '', true())" mode="strip"/>
        </xsl:copy>
    </xsl:template>

    <xsl:template match="*[not(self::p)]" mode="strip">
        <xsl:apply-templates/>
    </xsl:template>

</xsl:transform>

给出了

<rss>
    <entry>
        <body><![CDATA[<p>The Supreme Court issued on Friday a bailable warrant against sitting Calcutta high court justice CS Karnan, an unprecedented order in a bitter confrontation between the judge and the top court.</p><p>A seven-judge bench headed by Chief Justice of India JS Khehar issued the order directing Karnan’s presence on <h2>March 31</h2> because the judge ignored an earlier court order summoning him.<i>Justice Karnan</i> had to appear</p>]]></body>
    </entry>
</rss>

如果输入没有被转义,而是在输入中包含为XML,那么您不需要解析它,但只需将模式应用于内容:

<rss>
    <entry>
        <body><p>The Supreme Court issued on Friday a bailable warrant against sitting Calcutta high court justice CS Karnan, an unprecedented order in a bitter confrontation between the judge and the top court.</p><p>A seven-judge bench headed by Chief Justice of India JS Khehar issued the order directing Karnan’s presence on March 31 because the judge ignored an earlier court order summoning him.Justice Karnan had to appear</p></body>
    </entry>
</rss>

http://xsltransform.net/gWEamMc/1