将XML拆分为指定大小的较小XML文件

时间:2014-02-10 22:06:13

标签: java xml

我对XML很新,坏消息是我有以下结构的XML:

<record>
   <record_id>200</record_id>
   <record_rows>
        <record_row>some text</record_row>
        .................................
   </record_rows>
</record>

每条记录的记录行数不同,因此,每条记录的大小差异很大。我的任务是将文件(超过1GB)拆分为指定大小的单独xml文件。哪种解析器最好?此外,我想我应该采用一些记录选择策略来接近目标大小(我无法想象当时考虑到输入文件大小和下一个记录大小的不可预测性)

唯一的希望在于你,我的朋友们。你会怎么做?

1 个答案:

答案 0 :(得分:1)

假设您的记录行不大于单个文件的所需大小,您可以使用SAX Parser按顺序读取文件并计算读取的字符数,将目前读取的数据存储在缓冲区中。当字符数达到接近您的大小限制的值时,它将创建一个仅包含到目前为止读取的记录的新文件,重置缓冲区和字符计数,并将继续读取另一组,直到再次达到限制,并且等等。最后,您将拥有一组大小相同的文件(除了最后一个,可能更小)并且包含相同的数据。

要使用SAX解析器,您需要一个包含以下代码的可执行文件:

import java.io.*;
import javax.xml.parsers.*;
import org.xml.sax.*;

public class SAXReader {

    public static final String PATH = "src/main/resources";

    public static void main(String[] args) throws ParserConfigurationException, SAXException, IOException {
        SAXParserFactory spf = SAXParserFactory.newInstance();
        SAXParser sp = spf.newSAXParser();
        XMLReader reader = sp.getXMLReader();
        reader.setContentHandler(new DataSaxHandler()); // need to implement this file
        reader.parse(new InputSource(new FileInputStream(new File(PATH, "data.xml"))));
    }
}

您的XML文件存储在src/main/resources/data.xml中(相对于您运行应用程序的位置)。你可能想改变它。

如果拆分文件是格式良好的XML,它们也应该有一个根元素,并且可能保留record_id之类的信息,以便您可以知道它们来自哪条记录。我添加了一个属性part,其中包含一个排序文件片段的序号。生成的文件如下所示:

<强> data_part_1.xml

<record part='1'><record_id>200</record_id><record_rows><record_row>...</record_row><record_row>...</record_row> ... <record_row>...</record_row></record_rows></record>

<强> data_part_2.xml

<record part='2'><record_id>200</record_id><record_rows><record_row>...</record_row><record_row>...</record_row> ... <record_row>...</record_row></record_rows></record>

...

<强> data_part_n.xml

<record part='n'><record_id>200</record_id><record_rows><record_row>...</record_row><record_row>...</record_row><record_row>...</record_row><record_row>...</record_row></record_rows></record>

其中'n'是创建的文件数。

实现此结果的SAX ContentHandler实现如下所示。您可能想要更改DIRECTORYMAX_SIZE常量:

import java.io.*;
import org.xml.sax.*;
import org.xml.sax.helpers.DefaultHandler;

class DataSaxHandler extends DefaultHandler {

    // Change this to the directory where the files will be stored
    public static final String DIRECTORY = "target/results"; 

    // Change this to the approximate size of the resulting files (in characters(
    public static final long MAX_SIZE = 1024;


    public static final long TAG_CHAR_SIZE = 5; //"<></>"

    // counts number of files created
    private int fileCount = 0;

    // counts characters to decide where to split file
    private long charCount = 0;
    // data line buffer (is reset when the file is split)
    private StringBuilder recordRowDataLines = new StringBuilder();

    // temporary variables used for the parser events
    private String currentElement = null;
    private String currentRecordId = null;
    private String currentRecordRowData = null;

    @Override
    public void startDocument() throws SAXException {
        File dir = new File(DIRECTORY);
        if (!dir.exists()) {
            dir.mkdir();
        }
    }

    @Override
    public void startElement(String uri, String localName, String qName, Attributes atts) throws SAXException {
        currentElement = qName;
    }

    @Override
    public void endElement(String uri, String localName, String qName) throws SAXException {
        if (qName.equals("record_rows")) { // no more records - save last file here!
            try {
                saveFragment();
            } catch (IOException ex) {
                throw new SAXException(ex);
            }
        }
        if (qName.equals("record_row")) { // one record finished - save in buffer & calculate size so far
            charCount += tagSize("record_row");
            recordRowDataLines.append("<record_row>")
                              .append(currentRecordRowData)
                              .append("</record_row>");
            if (charCount >= MAX_SIZE) { // if max size was reached, save what was read so far in a new file
                try {
                    saveFragment();
                } catch (IOException ex) {
                    throw new SAXException(ex);
                }
            }
        }
        currentElement = null;
    }

    @Override
    public void characters(char[] ch, int start, int length) throws SAXException {
        System.out.println(new String(ch, start, length));
        if (currentElement == null) {
            return;
        }
        if (currentElement.equals("record_id")) {
            currentRecordId = new String(ch, start, length); 
        }
        if (currentElement.equals("record_row")) {
            currentRecordRowData = new String(ch, start, length);
            charCount += currentRecordRowData.length(); // storing size so far
        }
    }

    public long tagSize(String tagName) {
        return TAG_CHAR_SIZE + tagName.length() * 2; // size of text + tags
    }

    /**
     * Saves a new file containing approximately MAX_SIZE in chars 
     */
    public void saveFragment() throws IOException {
        ++fileCount;
        StringBuilder fileContent = new StringBuilder();
        fileContent.append("<record part='")
                   .append(fileCount)
                   .append("'><record_id>")
                   .append(currentRecordId)
                   .append("</record_id>")
                   .append("<record_rows>")
                   .append(recordRowDataLines)
                   .append("</record_rows></record>");
        File fragment = new File(DIRECTORY, "data_part_" + fileCount + ".xml");
        FileWriter out = new FileWriter(fragment);
        out.write(fileContent.toString());
        out.flush();
        out.close();

        // reset fragment data - record buffer and char count
        recordRowDataLines = new StringBuilder();
        charCount = 0;
    }

}