Question

我正在尝试设置一个Apache Camel路由，它输入一个大型XML文件，然后使用字段条件将有效负载拆分为两个不同的文件。即如果ID字段以1开头，则转到一个输出文件，否则转到另一个。使用Camel不是必须的，我也看过XSLT和常规的Java选项，但我觉得这应该有效。

我已经介绍了拆分实际的有效负载，但我在确保父节点（包括标头）也包含在每个文件中时遇到了问题。由于文件可能很大，我想确保将流用于有效负载。我觉得我在这里已经阅读了数百个不同的问题，博客条目等等，几乎每个案例都包括将整个文件加载到内存中，将文件平均分成几部分，只需单独使用有效负载节点。 / p>

我的原型XML文件如下所示：

<root>
    <header>
        <title>Testing</title>
    </header>
    <orders>
        <order>
            <id>11</id>
            <stuff>One</stuff>
        </order>
        <order>
            <id>20</id>
            <stuff>Two</stuff>
        </order>
        <order>
            <id>12</id>
            <stuff>Three</stuff>
        </order>
    </orders> 
</root>

结果应该是两个文件 - 条件为true（id以1开头）：

<root>
    <header>
        <title>Testing</title>
    </header>
    <orders>
        <order>
            <id>11</id>
            <stuff>One</stuff>
        </order>
        <order>
            <id>12</id>
            <stuff>Three</stuff>
        </order>
    </orders> 
</root>

条件错误：

<root>
    <header>
        <title>Testing</title>
    </header>
    <orders>
        <order>
            <id>20</id>
            <stuff>Two</stuff>
        </order>
    </orders> 
</root>

我的原型路线：

from("file:" + inputFolder)
.log("Processing file ${headers.CamelFileName}")
.split()
    .tokenizeXML("order", "*") // Includes parent in every node
    .streaming()
    .choice()
        .when(body().contains("id>1"))
            .to("direct:ones")
            .stop()
        .otherwise()
            .to("direct:others")
            .stop()
    .end()
.end();

from("direct:ones")
//.aggregate(header("ones"), new StringAggregator()) // missing end condition
.to("file:" + outputFolder + "?fileName=ones-${in.header.CamelFileName}&fileExist=Append");

from("direct:others")
//.aggregate(header("others"), new StringAggregator()) // missing end condition
.to("file:" + outputFolder + "?fileName=others-${in.header.CamelFileName}&fileExist=Append");

除了为每个节点添加父标记（页眉和页脚，如果愿意）之外，它的工作方式是有意的。仅使用tokenizeXML中的节点仅返回节点本身，但我无法弄清楚如何添加页眉和页脚。我希望将父标签流式传输到页眉和页脚属性，并在拆分之前和之后添加它们。

我该怎么做？我是否需要首先对父标签进行标记，这是否意味着将文件流式传输两次？

作为最后一点，您可能会注意到最后的汇总。我不想在写入文件之前聚合每个节点，因为这会破坏流式传输的目的并使整个文件保持内存不足，但我想我可能会在写入之前聚合多个节点来获得一些性能。文件，以减少为每个节点写入驱动器的性能。我不确定这是否合理。

Answer 1

我无法与Camel合作。或者更确切地说，当使用普通的Java来提取标题时，我已经拥有了我需要继续进行的所有操作并将其拆分并交换回Camel似乎很麻烦。有很多方法可以改进这一点，但这是我拆分XML有效负载的解决方案。

在两种类型的输出流之间切换并不是那么漂亮，但它可以简化其他一切的使用。另外值得注意的是，我选择了equalsIgnoreCase来检查标签名称，即使XML通常区分大小写。对我来说，它可以降低出错的风险。最后，确保你的正则表达式使用通配符匹配整个字符串，按照正常的字符串正则表达式。

/**
 * Splits a XML file's payload into two new files based on a regex condition. The payload is a specific XML tag in the
 * input file that is repeated a number of times. All tags before and after the payload are added to both files in order
 * to keep the same structure.
 * 
 * The content of each payload tag is compared to the regex condition and if true, it is added to the primary output file.
 * Otherwise it is added to the secondary output file. The payload can be empty and an empty payload tag will be added to
 * the secondary output file. Note that the output will not be an unaltered copy of the input as self-closing XML tags are
 * altered to corresponding opening and closing tags.
 * 
 * Data is streamed from the input file to the output files, keeping memory usage small even with large files.
 * 
 * @param inputFilename Path and filename for the input XML file
 * @param outputFilenamePrimary Path and filename for the primary output file
 * @param outputFilenameSecondary Path and filename for the secondary output file
 * @param payloadTag XML tag name of the payload
 * @param payloadParentTag XML tag name of the payload's direct parent
 * @param splitRegex The regex split condition used on the payload content
 * @throws Exception On invalid filenames, missing input, incorrect XML structure, etc.
 */
public static void splitXMLPayload(String inputFilename, String outputFilenamePrimary, String outputFilenameSecondary, String payloadTag, String payloadParentTag, String splitRegex) throws Exception {

    XMLInputFactory xmlInputFactory = XMLInputFactory.newInstance();
    XMLOutputFactory xmlOutputFactory = XMLOutputFactory.newInstance();
    XMLEventReader xmlEventReader = null;
    FileInputStream fileInputStream = null;
    FileWriter fileWriterPrimary = null;
    FileWriter fileWriterSecondary = null;
    XMLEventWriter xmlEventWriterSplitPrimary = null;
    XMLEventWriter xmlEventWriterSplitSecondary = null;

    try {
        fileInputStream = new FileInputStream(inputFilename);
        xmlEventReader = xmlInputFactory.createXMLEventReader(fileInputStream);

        fileWriterPrimary = new FileWriter(outputFilenamePrimary);
        fileWriterSecondary = new FileWriter(outputFilenameSecondary);
        xmlEventWriterSplitPrimary = xmlOutputFactory.createXMLEventWriter(fileWriterPrimary);
        xmlEventWriterSplitSecondary = xmlOutputFactory.createXMLEventWriter(fileWriterSecondary);

        boolean isStart = true;
        boolean isEnd = false;
        boolean lastSplitIsPrimary = true;

        while (xmlEventReader.hasNext()) {
            XMLEvent xmlEvent = xmlEventReader.nextEvent();

            // Check for start of payload element
            if (!isEnd && xmlEvent.isStartElement()) {
                StartElement startElement = xmlEvent.asStartElement();
                if (startElement.getName().getLocalPart().equalsIgnoreCase(payloadTag)) {
                    if (isStart) {
                        isStart = false;
                        // Flush the event writers as we'll use the file writers for the payload
                        xmlEventWriterSplitPrimary.flush();
                        xmlEventWriterSplitSecondary.flush();
                    }

                    String order = getTagAsString(xmlEventReader, xmlEvent, payloadTag, xmlOutputFactory);
                    if (order.matches(splitRegex)) {
                        lastSplitIsPrimary = true;
                        fileWriterPrimary.write(order);
                    } else {
                        lastSplitIsPrimary = false;
                        fileWriterSecondary.write(order);
                    }
                }
            }
            // Check for end of parent tag
            else if (!isStart && !isEnd && xmlEvent.isEndElement()) {
                EndElement endElement = xmlEvent.asEndElement();
                if (endElement.getName().getLocalPart().equalsIgnoreCase(payloadParentTag)) {
                    isEnd = true;
                }
            }
            // Is neither start or end and we're handling payload (most often white space)
            else if (!isStart && !isEnd) {
                // Add to last split handled
                if (lastSplitIsPrimary) {
                    xmlEventWriterSplitPrimary.add(xmlEvent);
                    xmlEventWriterSplitPrimary.flush();
                } else {
                    xmlEventWriterSplitSecondary.add(xmlEvent);
                    xmlEventWriterSplitSecondary.flush();
                }
            }

            // Start and end is added to both files
            if (isStart || isEnd) {
                xmlEventWriterSplitPrimary.add(xmlEvent);
                xmlEventWriterSplitSecondary.add(xmlEvent);
            }
        }

    } catch (Exception e) {
        logger.error("Error in XML split", e);
        throw e;
    } finally {
        // Close the streams
        try {
            xmlEventReader.close();
        } catch (XMLStreamException e) {
            // ignore
        }
        try {
            xmlEventReader.close();
        } catch (XMLStreamException e) {
            // ignore
        }
        try {
            xmlEventWriterSplitPrimary.close();
        } catch (XMLStreamException e) {
            // ignore
        }
        try {
            xmlEventWriterSplitSecondary.close();
        } catch (XMLStreamException e) {
            // ignore
        }
        try {
            fileWriterPrimary.close();
        } catch (IOException e) {
            // ignore
        }
        try {
            fileWriterSecondary.close();
        } catch (IOException e) {
            // ignore
        }
    }
}

/**
 * Loops through the events in the {@code XMLEventReader} until the specific XML end tag is found and returns everything
 * contained within the XML tag as a String.
 * 
 * Data is streamed from the {@code XMLEventReader}, however the String can be large depending of the number of children
 * in the XML tag.
 * 
 * @param xmlEventReader The already active reader. The starting tag event is assumed to have already been read
 * @param startEvent The starting XML tag event already read from the {@code XMLEventReader}
 * @param tag The XML tag name used to find the starting XML tag
 * @param xmlOutputFactory Convenience include to avoid creating another factory
 * @return String containing everything between the starting and ending XML tag, the tags themselves included
 * @throws Exception On incorrect XML structure
 */
private static String getTagAsString(XMLEventReader xmlEventReader, XMLEvent startEvent, String tag, XMLOutputFactory xmlOutputFactory) throws Exception {
    StringWriter stringWriter = new StringWriter();
    XMLEventWriter xmlEventWriter = xmlOutputFactory.createXMLEventWriter(stringWriter);

    // Add the start tag
    xmlEventWriter.add(startEvent);

    // Add until end tag
    while (xmlEventReader.hasNext()) {
        XMLEvent xmlEvent = xmlEventReader.nextEvent();

        // End tag found
        if (xmlEvent.isEndElement() && xmlEvent.asEndElement().getName().getLocalPart().equalsIgnoreCase(tag)) {
            xmlEventWriter.add(xmlEvent);
            xmlEventWriter.close();
            stringWriter.close();

            return stringWriter.toString();
        } else {
            xmlEventWriter.add(xmlEvent);
        }
    }

    xmlEventWriter.close();
    stringWriter.close();
    throw new Exception("Invalid XML, no closing tag for <" + tag + "> found!");
}

Camel，使用字段条件

1 个答案: