Question

我正在尝试创建一个将拆分所选XML文件的java程序。

XML文件数据样本：

<EmployeeDetails>
<Employee>
<FirstName>Ben</FirstName>
</Employee>
<Employee>
<FirstName>George</FirstName>
</Employee>
<Employee>
<FirstName>Cling</FirstName>
</Employee>
<EmployeeDetails>

依此类推，我有这个250mb的XML文件，它总是痛苦地打开它的外部程序并手动拆分它以便能够与其他人一起阅读（并非所有的笔记本电脑/台式机都可以打开这么大的文件）。所以我决定创建一个具有此功能的Java程序： - 选择XML文件（已完成） - 基于标签数量的分离文件，例如。（当前文件有100k标签我会询问程序用户他/她对分割文件的需求。例如（每个文件10k） - 拆分文件（已完成）

我只是想请求帮助，我怎么可能完成第二项任务，已经在3-4天内检查我怎么可能做到这一点，或者它是否可行（在我看来当然是这样）。< / p>

任何回复都将不胜感激。

干杯，格林。

Answer 1

假设一个扁平结构，其中文档R的根元素有大量名为X的子元素，则以下XSLT 2.0转换将每隔第N个X元素拆分该文件。

<t:transform xmlns:t="http://www.w3.org/1999/XSL/Transform"
  version="2.0">
  <t:param name="N" select="100"/>
  <t:template match="/*">
    <t:for-each-group select="X" 
                      group-adjacent="(position()-1) idiv $N">
      <t:result-document href="{position()}.xml">
        <R>
          <t:copy-of select="current-group()"/>
        </R>
      </t:result-document>
   </t:for-each-group>
  </t:template>
</t:transform>

如果你想在流模式下运行它（不在内存中构建源树），那么（a）添加<xsl:mode streamable="yes"/>，然后（b）使用XSLT 3.0处理器（Saxon-EE或Exselt）运行它）。

Answer 2

一个简单的解决方案是有序的。如果XML总是如图所示那些换行符，则不需要XML处理。

Path originalPath = Paths.get("... .xml");
try (BufferedReader in = Files.newBufferedReader(originalPath, StandardCharsets.UTF_8)) {
    String line = in.readLine(); // Skip header line(s)

    line = in.readLine();
    for (int fileno; line != null && !line.contains("</EmployeeDetails>"); ++fileno) {
        Path partPath = Paths.get("...-" + fileno + ".xml");
        try (PrintWriter out = new PrintWriter(Files.newBufferedWriter(partPath,
                StandardCharsets.UTF_8))) {
            int counter = 0;
            out.println("<EmployeeDetails>"); // Write header.
            do {
                out.println(line);
                if (line.contains("</Employee>") {
                    ++counter;
                }
                line = in.readLine();
            } while (line != null && !line.contains("</EmployeeDetails>")
                    && counter < 1000);
            out.println("</EmployeeDetails>");
        }
    }
}

使用Java

2 个答案: