Question

我需要一些XSLT（或某些东西 - 见下文）用替代字符替换所有属性中的换行符。

我必须处理将所有数据存储为属性的旧XML，并使用换行来表示基数。例如：

<sample>
    <p att="John
    Paul
    Ringo"></p>
</sample>

当我用Java解析文件时（根据XML规范），这些新行被替换为空格，但是我希望将它们视为列表，因此这种行为不是特别有用。

我的“解决方案”是使用XSLT将所有属性中的所有换行替换为其他分隔符 - 但我对XSLT一无所知。到目前为止，我看到的所有示例都非常具体，或者已经替换了节点内容而不是属性值。

我已经涉足了XSLT 2.0 replace()，但我很难将所有内容放在一起。

XSLT是否是正确的解决方案？使用下面的XSLT：

<xsl:template match="sample/*">
    <xsl:for-each select="@*">
        <xsl:value-of select="replace(current(), '\n', '|')"/>
    </xsl:for-each>
</xsl:template>

应用于示例XML使用Saxon输出以下内容：

John Paul Ringo

显然这种格式不是我想要的 - 这只是为了试验replace() - 但是当我们进行XSLT处理时，新行已经被标准化了吗？如果是这样，有没有其他方法可以使用Java解析器将这些值解析为writ？到目前为止我只使用过JAXB。

Answer 1

似乎很难做到这一点。正如我在Are line breaks in XML attribute values allowed?中发现的那样 - 属性中的新行字符是有效的，但XML解析器将其规范化（https://stackoverflow.com/a/8188290/1324394），因此它可能在处理之前（因此在替换之前）丢失。

Answer 2

我通过使用JSoup预处理XML来解决（问题）这个问题（这是对@Ian Roberts关于使用非XML工具解析XML的评论的一种认可）。 JSoup是（或曾经）为HTML文档设计的，但在这种情况下效果很好。

我的代码如下：

@Test
public void verifyNewlineEscaping() {
    final List<Node> nodes = Parser.parseXmlFragment(FileUtils.readFileToString(sourcePath.toFile(), "UTF-8"), "");

    fixAttributeNewlines(nodes);

    // Reconstruct XML
    StringBuilder output = new StringBuilder();
    for (Node node : nodes) {
        output.append(node.toString());
    }

    // Print cleansed output to stdout
    System.out.println(output);
}

/**
 * Replace newlines and surrounding whitespace in XML attributes with an alternative delimiter in
 * order to avoid whitespace normalisation converting newlines to a single space.
 * 
 * <p>
 * This is useful if newlines which have semantic value have been incorrectly inserted into
 * attribute values.
 * </p>
 * 
 * @param nodes nodes to update
 */
private static void fixAttributeNewlines(final List<Node> nodes) {

    /*
     * Recursively iterate over all attributes in all nodes in the XML document, performing
     * attribute string replacement
     */
    for (final Node node : nodes) {
        final List<Attribute> attributes = node.attributes().asList();

        for (final Attribute attribute : attributes) {

            // JSoup reports whitespace as attributes
            if (!StringUtils.isWhitespace(attribute.getValue())) {
                attribute.setValue(attribute.getValue().replaceAll("\\s*\r?\n\\s*", "|"));
            }
        }

        // Recursively process child nodes
        if (!node.childNodes().isEmpty()) {
            fixAttributeNewlines(node.childNodes());
        }
    }
}

对于我的问题中的示例XML，此方法的输出是：

<sample> 
    <p att="John|Paul|Ringo"></p> 
</sample>

请注意，我没有使用
，因为JSoup在其字符转义时非常警惕并且在属性值中转义所有。它还用UTF-8等效替换现有的数字实体引用，因此时间会判断这是否是一个可通过的解决方案。

Answer 3

XSLT只有在XML解析器处理完XML后才能看到XML，它将完成属性值规范化。

我认为某些XML解析器可以选择抑制属性值规范化。如果您无法访问此类解析器，我认为在解析之前通过
进行文本替换（\ r？\ n）可能是您最好的转义路径。以这种方式转义的换行符不会被属性值规范化所玷污。

使用XSLT替换XML属性中的换行符

3 个答案: