Question

我有一个XML文件，它有很多嵌套的主题元素。例如：

<?xml version="1.0" encoding="UTF-8"?>
<topic id="topic-1">
    <title>ADBT</title>

    <para>The program executes a database request by using the ADBT
        library. The ADBT library prepares
        the request and calls an ODBC driver
        or a native API.  
    </para>

    <topic id="topic_wom_eqy_ev">
        <title>Establishing a connection</title>
        <para>
            In order to use a database with ADBT, the first step to be taken
            is
            to establish a
            connection.
        </para>

    </topic>
    <topic id="topic_dsw_gqy_ev">
        <title>Querying a database</title>
        <para>Querying a database involves a number of stages.</para>
        <topic id="topic_ljf_isy_ev">
            <title>Stage one: create a query</title>
            <para> A new query (ADBT_Select object) can only be created starting
                from a previously
                established connection. A query is created using
                the CreateSelect method in two
                different
                ways:
            </para>
        </topic>
    </topic>

</topic>

我希望将每个主题分成一个新的XML文件，其文件名与title相同。如果主题包含另一个主题，则该主题将是单独的文件，父主题将是一个单独的文件，其内容不包括子主题。例如，在这种情况下，将有四个文件作为输出，具有以下内容：

第1名：

<topic id="topic-1">
        <title>ADBT</title>

        <para>The program executes a database request by using the ADBT
            library. The ADBT library prepares
            the request and calls an ODBC driver or a native API.  
        </para>
    </topic>

2号：

<topic id="topic_wom_eqy_ev">
        <title>Establishing a connection</title>
        <para>
            In order to use a database with ADBT, the first step to be taken is
            to establish a
            connection. 
        </para>     

    </topic>

3号：

<topic id="topic_dsw_gqy_ev">
        <title>Querying a database</title>
        <para>Querying a database involves a number of stages.</para>
</topic>

第4名：

<topic id="topic_ljf_isy_ev">
            <title>Stage one: create a query</title>
            <para> A new query (ADBT_Select object) can only be created starting
                from a previously
                established connection. A query is created using the CreateSelect method in two
                different
                ways:
            </para>
            </topic>

我写了很少的函数，但我无法弄清楚如何分离多级嵌套主题。

Answer 1

基本上，你想要做的是：

使用您选择的XML阅读器阅读XML
以递归方式获取文档中的所有<topic>元素
对于每个<topic>元素，创建该元素的副本（可能是每个元素的新文档，其根目录为<topic>元素），从原始元素复制所有子元素但是tagName = topic的孩子。这可以保证递归调用不会产生重叠元素
对于每个这样创建的Document，使用您选择的XML编写器将其序列化为文件

因此，对于原理图代码：

Document document = readXMLDocument(...);
List<Element> topicElements = readTopicElementsRecursively(document);
List<Document> splitTopicDocuments = new ArrayList<>();
for (Element el : topicElements) {
    Document doc = copyElementWithoutTopicChildren(el);
    splitTopicDocuments.add(doc);
}
writeTopicDocuments(splitTopicDocuments);

Answer 2

使用XSLT 2.0可用于Saxon 9的Java，您可以使用

<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
    xmlns:xs="http://www.w3.org/2001/XMLSchema"
    exclude-result-prefixes="xs"
    version="2.0">

    <xsl:template match="/">
        <xsl:for-each select="//topic">
            <xsl:result-document href="topic{position()}.xml">
                <xsl:call-template name="identity"/>
            </xsl:result-document>          
        </xsl:for-each>
    </xsl:template>

    <xsl:template match="@* | node()" name="identity">
        <xsl:copy>
            <xsl:apply-templates select="@* | node()"/>
        </xsl:copy>
    </xsl:template>

    <xsl:template match="topic"/>

</xsl:stylesheet>

将XML中的元素分隔为单独的文件

2 个答案: