Question

我想要一个XML文件，结构严密，大小只有一半，并从中创建另一个XML文件，只包含原始元素的选定元素。

1）我该怎么做？

2）可以用DOM Parser完成吗？ DOM解析器的大小限制是多少？

谢谢！

Answer 1

如果你有一个非常大的源XML（比如你的0.5 GB文件），并希望从中提取信息，可能会创建一个新的XML，你可以考虑使用一个基于事件的解析器，它不需要加载整个XML在记忆中。这些实现中最简单的是SAX解析器，它要求您编写一个事件监听器，它将捕获诸如document-start，element-start，element-end等事件，您可以在其中检查您正在读取的数据（名称）元素，属性等）并决定你是否要忽略它或对数据做些什么。

使用JAXP搜索SAX教程，您应该找到几个示例。您可能需要考虑的另一种策略，取决于您想要做什么是StAX。

这是一个使用SAX从XML文件读取数据并根据搜索条件提取一些信息的简单示例。这是我用来教SAX处理的一个非常简单的例子。我认为这可能有助于您了解它的工作原理。搜索条件是硬连线的，由电影导演的名字组成，用于搜索巨型XML，并从IMDB数据生成电影选择。

XML源示例（＆＃34; source.xml＆＃34; ~300MB文件）

<Movies>
    ...
    <Movie>
        <Imdb>tt1527186</Imdb>
        <Title>Melancholia</Title>
        <Director>Lars von Trier</Director>
        <Year>2011</Year>
        <Duration>136</Duration>
    </Movie>
    <Movie>
        <Imdb>tt0060390</Imdb>
        <Title>Fahrenheit 451</Title>
        <Director>François Truffaut</Director>
        <Year>1966</Year>
        <Duration>112</Duration>
    </Movie>
    <Movie>
        <Imdb>tt0062622</Imdb>
        <Title>2001: A Space Odyssey</Title>
        <Director>Stanley Kubrick</Director>
        <Year>1968</Year>
        <Duration>160</Duration>
    </Movie>
    ...
</Movies>

以下是事件处理程序的示例。它通过匹配字符串来选择Movie元素。我扩展了DefaultHandler并实现了startElement()（在找到开始标记时调用），characters()（在读取一个字符块时调用），endElement()（在结束时调用）找到了标签）和endDocument()（文件完成后调用一次）。由于读取的数据未保留在内存中，因此您必须自己保存感兴趣的数据。我使用了一些布尔标志和实例变量来保存当前标签，当前数据等。

class ExtractMovieSaxHandler extends DefaultHandler {

    // These are some parameters for the search which will select 
    // the subtrees (they will receive data when we set up the parser)
    private String tagToMatch;
    private String tagContents; // OR match
    private boolean strict = false;  // if strict matches will be exact

    /**
     * Sets criteria to select and copy Movie elements from source XML.
     *
     * @param tagToMatch Must contain text only
     * @param tagContents Text contents of the tag
     * @param strict If true, match must be exact
     */
    public void setSearchCriteria(String tagToMatch, String tagContents, boolean strict) {
        this.tagToMatch = tagToMatch;
        this.tagContents = tagContents;
        this.strict = strict;
    }

    // These are the temporary values we store as we parse the file
    private String currentElement;
    private StringBuilder contents = null; // if not null we are in Movie tag
    private String currentData;
    List<String> result = new ArrayList<String>(); // store resulting nodes here
    private boolean skip = false;

...

这些方法是ContentHandler的实现。第一个检测到找到的元素（开始标记）。我们将变量的名称（Movie的子项）保存在变量中，因为它可能是我们在搜索中使用的名称：

...

    @Override
    public void startElement(String uri, String localName, String qName, Attributes atts) throws SAXException {

        // Store the current element that started now
        currentElement = qName;

        // If this is a Movie tag, save the contents because we might need it
        if (qName.equals("Movie")) {
            contents = new StringBuilder();
        }

    }
...

每次调用一个字符块时都会调用此字符。我们检查这些字符是否发生在我们感兴趣的元素中。如果是，我们匹配内容并保存，如果匹配。

...
    @Override
    public void characters(char[] ch, int start, int length) throws SAXException {

        // if we discovered that we don't need this data, we skip it
        if (skip || currentElement == null) {
            return;
        }

        // If we are inside the tag we want to search, save the contents
        currentData = new String(ch, start, length);

        if (currentElement.equals(tagToMatch)) {
            boolean discard = true;

            if (strict) {
                if (currentData.equals(tagContents)) { // exact match
                    discard = false;
                }

            } else {
                if (currentData.toLowerCase().indexOf(tagContents.toLowerCase()) >= 0) { // matches occurrence of substring
                    discard = false;
                }
            }

            if (discard) {
                skip = true;
            }
        }

    }
...

在找到结束标记时调用此方法。如果我们愿意，我们现在可以将它附加到我们在内存中构建的文档中。

...
    @Override
    public void endElement(String uri, String localName, String qName) throws SAXException {

        // Rebuild the XML if it's a node we didn't skip
        if (qName.equals("Movie")) {
            if (!skip) {
                result.add(contents.insert(0, "<Movie>").append("</Movie>").toString());
            }

            // reset the variables so we can check the next node
            contents = null;
            skip = false;
        } else if (contents != null && !skip) {
            contents.append("<").append(qName).append(">")
                    .append(currentData)
                    .append("</").append(qName).append(">");
        }

        currentElement = null;
    }
...

最后，当文档结束时调用此文件。我还用它来打印结果。

...
    @Override
    public void endDocument() throws SAXException {
        StringBuilder resultFile = new StringBuilder();
        resultFile.append("<?xml version=\"1.0\" encoding=\"UTF-8\"?>");
        resultFile.append("<Movies>");
        for (String childNode : result) {
            resultFile.append(childNode.toString());
        }
        resultFile.append("</Movies>");

        System.out.println("=== Resulting XML containing Movies where " + tagToMatch + " is one of " + tagContents + " ===");
        System.out.println(resultFile.toString());
    }

}

这是一个小型Java应用程序，它加载该文件，并使用事件处理程序来提取数据。

public class SAXReaderExample {

    public static final String PATH = "src/main/resources"; // this is where I put the XML file

    public static void main(String[] args) throws ParserConfigurationException, SAXException, IOException {

        // Obtain XML Reader
        SAXParserFactory spf = SAXParserFactory.newInstance();
        SAXParser sp = spf.newSAXParser();
        XMLReader reader = sp.getXMLReader();

        // Instantiate SAX handler
        ExtractMovieSaxHandler handler = new ExtractMovieSaxHandler();

        // set search criteria
        handler.setSearchCriteria("Director", "Kubrick", false);

        // Register handler with XML reader
        reader.setContentHandler(handler);

        // Parse the XML
        reader.parse(new InputSource(new FileInputStream(new File(PATH, "source.xml"))));
    }
}

以下是处理后生成的文件：

<?xml version="1.0" encoding="UTF-8"?>
<Movies>
    <Movie>
        <Imdb>tt0062622</Imdb>
        <Title>2001: A Space Odyssey</Title>
        <Director>Stanley Kubrick</Director>
        <Year>1968</Year>
        <Duration>160</Duration>
    </Movie>
    <Movie>
        <Imdb>tt0066921</Imdb>
        <Title>A Clockwork Orange</Title>
        <Director>Stanley Kubrick</Director>
        <Year>1972</Year>
        <Duration>136</Duration>
    </Movie>
    <Movie>
        <Imdb>tt0081505</Imdb>
        <Title>The Shining</Title>
        <Director>Stanley Kubrick</Director>
        <Year>1980</Year>
        <Duration>144</Duration>
    </Movie>
    ...
</Movies>

您的方案可能有所不同，但此示例显示了一个可以适应您的问题的一般解决方案。您可以在有关SAX和JAXP的教程中找到更多信息。

Answer 2

500Mb完全在使用XSLT可以实现的范围内。这取决于你想花多少钱来开发最佳解决方案：即哪个更贵，你的时间还是机器的时间？

基于Java中的另一个XML创建XML

2 个答案: