Question

我正在尝试将一些XML提供给Apache Solr，但是一些XML在文本中包含一些HTML格式，不允许我发布到我的solr服务器。显然，能够保存这些信息会很好，因为我的文档可以在发布之前预先格式化。但我还没有看到或意识到逃避是否会避免solr与HTML的问题。我的问题很热，我是否使用XSLT从XML中删除HTML？

例如：

What I have:

<field name="description"><h1>This is a description of a doc!</h1><p> This doc contains some information</p></field>

What I need:

<field name="description">This is a description of a doc! This doc contains some information.</field>

我希望有一个智能修复程序，而不是在xsl转换期间不擦除特定标记的黑名单。这样效率很低，因为如果一个人决定创建一个带有标签的新文档，那么黑名单就不会看到这个，除非程序员手动添加它。

我已经尝试将HTML标记转换为html实体（＆lt;和＆amp; gr;分别用于＆lt;和＆gt;），但是当我尝试通过BasicNameValuePairs通过HtmlPost发布此内容时，这会使事情变得更糟。我不想使用这些实体。

任何想法StackOverflow？

Answer 1

如果您知道包含HTML的元素，则可以匹配任何元素后代并执行apply-templates。

示例...

XML输入

<field name="description"><h1>This is a <b>description</b> of a doc!</h1><!--Here's a comment--><p> This doc contains some information</p></field>

XSLT 1.0

<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
    <xsl:output method="xml" indent="yes"/>

    <xsl:template match="node()|@*">
        <xsl:copy>
            <xsl:apply-templates select="node()|@*"/>
        </xsl:copy>
    </xsl:template>

    <xsl:template match="node()[ancestor::field and not(self::text())]">
        <xsl:apply-templates/>
    </xsl:template>

</xsl:stylesheet>

XML输出

<field name="description">This is a description of a doc! This doc contains some information</field>

删除XML中的所有HTML

1 个答案: