需要高性能的文本文件读取和解析(split() - like)

时间:2011-05-03 07:59:20

标签: java

目前我有:

  • 1个包含9百万行的文件
  • BufferedReader.readLine()读取每一行
  • String.split()用于解析每一行(由管道分隔的列)
  • 使用了大量的RAM(因为String interning?)

问题是:正如您可能已经猜到的那样,我想更好地阅读和解析此文件......

问题:

  • 如何使用最少的资源读取这个相对较大的文件(知道每一行都需要在管道上进行某种“拆分”)?
  • 我可以用其他东西替换String.split(比方说,StringBuilder,CharBuffer,......)?
  • 在将文件拆分为最终字符序列之前,避免使用字符串读取文件的最佳方法是什么?
  • 我不介意在我的POJO中使用其他字符串,如果你有更好的东西吗?
  • 文件将每隔几个小时重新加载一次,如果这有助于您给我一个解决方案?

谢谢:)

1 个答案:

答案 0 :(得分:2)

一个900万行文件应该不到几秒钟。大部分时间都花在将数据读入内存中。如何分割数据不太可能产生很大的重要性。

BufferedReader和String.split听起来不错。除非你确定这会有所帮助,否则我不会使用实习。 (它不会为你实习生())

最新版本的Java 6在处理字符串方面有一些性能改进。我会尝试使用Java 6 update 25来查看它是否更快。


编辑:做一些测试发现分裂速度非常慢,你可以改进它。

public static void main(String... args) throws IOException {
    long start1 = System.nanoTime();
    PrintWriter pw = new PrintWriter("deleteme.txt");
    StringBuilder sb = new StringBuilder();
    for (int j = 1000; j < 1040; j++)
        sb.append(j).append(' ');
    String outLine = sb.toString();
    for (int i = 0; i < 1000 * 1000; i++)
        pw.println(outLine);
    pw.close();
    long time1 = System.nanoTime() - start1;
    System.out.printf("Took %f seconds to write%n", time1 / 1e9);

    {
        long start = System.nanoTime();
        FileReader fr = new FileReader("deleteme.txt");
        char[] buffer = new char[1024 * 1024];
        while (fr.read(buffer) > 0) ;
        fr.close();
        long time = System.nanoTime() - start;
        System.out.printf("Took %f seconds to read text as fast as possible%n", time / 1e9);
    }
    {
        long start = System.nanoTime();
        BufferedReader br = new BufferedReader(new FileReader("deleteme.txt"));
        String line;
        while ((line = br.readLine()) != null) {
            String[] words = line.split(" ");
        }
        br.close();
        long time = System.nanoTime() - start;
        System.out.printf("Took %f seconds to read lines and split%n", time / 1e9);
    }
    {
        long start = System.nanoTime();
        BufferedReader br = new BufferedReader(new FileReader("deleteme.txt"));
        String line;
        Pattern splitSpace = Pattern.compile(" ");
        while ((line = br.readLine()) != null) {
            String[] words = splitSpace.split(line, 0);
        }
        br.close();
        long time = System.nanoTime() - start;
        System.out.printf("Took %f seconds to read lines and split (precompiled)%n", time / 1e9);
    }
    {
        long start = System.nanoTime();
        BufferedReader br = new BufferedReader(new FileReader("deleteme.txt"));
        String line;
        List<String> words = new ArrayList<String>();
        while ((line = br.readLine()) != null) {
            words.clear();
            int pos = 0, end;
            while ((end = line.indexOf(' ', pos)) >= 0) {
                words.add(line.substring(pos, end));
                pos = end + 1;
            }
            // words.
            //System.out.println(words);
        }
        br.close();
        long time = System.nanoTime() - start;
        System.out.printf("Took %f seconds to read lines and break using indexOf%n", time / 1e9);
    }
}

打印

Took 1.757984 seconds to write
Took 1.158652 seconds to read text as fast as possible
Took 6.671587 seconds to read lines and split
Took 4.210100 seconds to read lines and split (precompiled)
Took 1.642296 seconds to read lines and break using indexOf

所以看起来自己拆分字符串是一种改进,让你尽可能快地接近踩踏文本。更快地读取它的唯一方法是将文件视为二进制/ ASCII-7。 ;)