Question

我想将包含字符串的大文件拆分成一组新的（较小的）文件并尝试使用nio2。

我不想将整个文件加载到内存中，所以我尝试使用BufferedReader。

较小的文本文件应受文本行数限制。

解决方案有效，但是我想问一下，如果有人知道一个具有更好性能的解决方案，请使用java 8（也许lamdas with stream（） - api？）和nio2：

public void splitTextFiles(Path bigFile, int maxRows) throws IOException{

        int i = 1;
        try(BufferedReader reader = Files.newBufferedReader(bigFile)){
            String line = null;
            int lineNum = 1;

            Path splitFile = Paths.get(i + "split.txt");
            BufferedWriter writer = Files.newBufferedWriter(splitFile, StandardOpenOption.CREATE);

            while ((line = reader.readLine()) != null) {

                if(lineNum > maxRows){
                    writer.close();
                    lineNum = 1;
                    i++;
                    splitFile = Paths.get(i + "split.txt");
                    writer = Files.newBufferedWriter(splitFile, StandardOpenOption.CREATE);
                }

                writer.append(line);
                writer.newLine();
                lineNum++;
            }

            writer.close();
        }
}

Answer 1

请注意直接使用 InputStreamReader / OutputStreamWriter及其子类与Reader / Writer factory methods of Files之间的区别。虽然在前一种情况下，系统的默认编码在没有给出显式字符集时使用，但后者始终默认为UTF-8。所以我强烈建议您始终指定所需的字符集，即使它是Charset.defaultCharset()或StandardCharsets.UTF_8来记录您的意图，如果您在创建Reader或{的各种方法之间切换，也可以避免出现意外{1}}。

如果要在行边界处分割，则无法查看文件的内容。因此，您无法以like when merging的方式对其进行优化。

如果您愿意牺牲便携性，可以尝试一些优化。如果您知道charset编码将明确地将Writer映射到'\n'，就像大多数单字节编码以及(byte)'\n'一样，您可以扫描字节级别上的换行符获取拆分的文件位置，避免从应用程序到I / O系统的任何数据传输。

UTF-8

缺点是它不适用于public void splitTextFiles(Path bigFile, int maxRows) throws IOException { MappedByteBuffer bb; try(FileChannel in = FileChannel.open(bigFile, READ)) { bb=in.map(FileChannel.MapMode.READ_ONLY, 0, in.size()); } for(int start=0, pos=0, end=bb.remaining(), i=1, lineNum=1; pos<end; lineNum++) { while(pos<end && bb.get(pos++)!='\n'); if(lineNum < maxRows && pos<end) continue; Path splitFile = Paths.get(i++ + "split.txt"); // if you want to overwrite existing files use CREATE, TRUNCATE_EXISTING try(FileChannel out = FileChannel.open(splitFile, CREATE_NEW, WRITE)) { bb.position(start).limit(pos); while(bb.hasRemaining()) out.write(bb); bb.clear(); start=pos; lineNum = 0; } } }或UTF-16这样的编码，与EBCDIC不同，它不支持使用单独的BufferedReader.readLine()作为行终止符在旧的MacOS9中。

此外，它仅支持小于2GB的文件;由于虚拟地址空间有限，32Bit JVM上的限制可能更小。对于大于限制的文件，有必要对源文件的块进行迭代，并逐个'\r'。

这些问题可以修复，但会增加这种方法的复杂性。鉴于我的机器上的速度提升仅为15％（我没想到I / O在这里占主导地位），并且当复杂性提高时甚至会更小，我认为这不值得。

最重要的是，对于此任务，map / Reader方法已足够，但您应该注意用于操作的Writer。

Answer 2

我对@ nimo23代码进行了一些修改，考虑了为每个拆分文件添加页眉和页脚的选项，还将文件输出到与原始文件同名的目录中_split附加到它。以下代码：

public static void splitTextFiles(String fileName, int maxRows, String header, String footer) throws IOException
    {
        File bigFile = new File(fileName);
        int i = 1;
        String ext = fileName.substring(fileName.lastIndexOf("."));

        String fileNoExt = bigFile.getName().replace(ext, "");
        File newDir = new File(bigFile.getParent() + "\\" + fileNoExt + "_split");
        newDir.mkdirs();
        try (BufferedReader reader = Files.newBufferedReader(Paths.get(fileName)))
        {
            String line = null;
            int lineNum = 1;
            Path splitFile = Paths.get(newDir.getPath() + "\\" +  fileNoExt + "_" + String.format("%03d", i) + ext);
            BufferedWriter writer = Files.newBufferedWriter(splitFile, StandardOpenOption.CREATE);
            while ((line = reader.readLine()) != null)
            {
                if(lineNum == 1)
                {
                    writer.append(header);
                    writer.newLine();
                }
                writer.append(line);
                writer.newLine();
                lineNum++;
                if (lineNum > maxRows)
                {
                    writer.append(footer);
                    writer.close();
                    lineNum = 1;
                    i++;
                    splitFile = Paths.get(newDir.getPath() + "\\" + fileNoExt + "_" + String.format("%03d", i) + ext);
                    writer = Files.newBufferedWriter(splitFile, StandardOpenOption.CREATE);
                }
            }
            if(lineNum <= maxRows) // early exit
            {
                writer.append(footer);
            }
            writer.close();
        }

        System.out.println("file '" + bigFile.getName() + "' split into " + i + " files");
    }

按最大行分割非常大的文本文件

2 个答案: