Java读取大文件java堆空间

时间:2017-04-06 16:20:48

标签: java heap space

我写了这段代码:

try(BufferedReader file = new BufferedReader(new FileReader("C:\\Users\\User\\Desktop\\big50m.txt"));){
              String line;
              StringTokenizer st;

              while ((line = file.readLine()) != null){
                  st  = new StringTokenizer(line); // Separation of integers of the file line
                  while(st.hasMoreTokens())
                       numbers.add(Integer.parseInt(st.nextToken())); //Converting and adding to the list of numbers
                  }

          }
          catch(Exception e){
              System.out.println("Can't read the file...");

          }

big50m文件有50.000.000个整数,我得到了这个运行时错误:

Exception in thread "main" java.lang.OutOfMemoryError: Java heap space
    at java.util.Arrays.copyOf(Arrays.java:3332)
    at java.lang.AbstractStringBuilder.ensureCapacityInternal(AbstractStringBuilder.java:124)
    at java.lang.AbstractStringBuilder.append(AbstractStringBuilder.java:596)
    at java.lang.StringBuffer.append(StringBuffer.java:367)
    at java.io.BufferedReader.readLine(BufferedReader.java:370)
    at java.io.BufferedReader.readLine(BufferedReader.java:389)
    at unsortedfilesapp.UnsortedFilesApp.main(UnsortedFilesApp.java:37)
C:\Users\User\AppData\Local\NetBeans\Cache\8.2\executor-snippets\run.xml:53: Java returned: 1
BUILD FAILED (total time: 5 seconds)

我认为问题是名为line的字符串变量。你能告诉我怎么样吗? 要解决这个问题 ?因为我想要快速阅读,所以我使用StringTokenizer。

5 个答案:

答案 0 :(得分:1)

Create a BufferedReader from the file and read() char by char. Put digit char into a String, then Integer.parseInt(), skip any non-digit char and continue parsing on the the next digit, etc, etc.

答案 1 :(得分:0)

这是一个最小化内存使用量的版本。没有字节到字符转换。没有字符串操作。但在这个版本中它没有处理负数。

    public static void main(final String[]a) {
        final Set<Integer> number = new HashSet<>();
        int v = 0;
        boolean use = false;
        int c;
        // Input stream avoid char conversion
        try(InputStream s = new FileInputStream("C:\\Users\\User\\Desktop\\big50m.txt")) {
            // No allocation in the loop
            do {
                if((c = s.read()) == -1) break;
                if(c>='0' && c<='9') { v = v * 10 + c-'0'; use =     true; continue; }
                if(use) number.add(v);
                use = false;
                v = 0;
            } while(true);
            if(use) number.add(v);
        } catch(final Exception e){ System.out.println("Can't read the file..."); }
    }

答案 2 :(得分:0)

readLine()方法立即读取整行,从而占用大量内存。这是非常低效的,不会扩展到任意大文件。

您可以使用StreamTokenizer

像这样:

StreamTokenizer tokenizer = new StreamTokenizer(new FileReader("bigfile.txt"));
tokenizer.parseNumbers(); // default behaviour
while (tokenizer.nextToken() != StreamTokenizer.TT_EOF) {
    if (tokenizer.ttype == StreamTokenizer.TT_NUMBER) {
        numbers.add((int)Math.round(tokenizer.nval));
    }
}

我没有测试过这段代码,但它给了你一般的想法。

答案 3 :(得分:0)

在使用-Xmx2048m运行程序时,提供的代码段有效(通过一些调整:声明的数字为List number = new ArrayList&lt;&gt;(50000000);)

答案 4 :(得分:0)

由于所有数字都在一行内,BufferedReader方法无法正常工作或扩展。完整的文件将被读入内存。因此,流媒体方法(例如来自@whbogado)确实是要走的路。

StreamTokenizer tokenizer = new StreamTokenizer(new FileReader("bigfile.txt"));
tokenizer.parseNumbers(); // default behaviour
while (tokenizer.nextToken() != StreamTokenizer.TT_EOF) {
    if (tokenizer.ttype == StreamTokenizer.TT_NUMBER) {
        numbers.add((int)Math.round(tokenizer.nval));
    }
}

在你写作的时候,你也得到了一个堆空间错误,我认为,这不再是流媒体的问题了。不幸的是,您将所有值存储在List中。我认为这是现在的问题。你在评论中说,你不知道实际的数字数。因此,您应该避免将它们存储在列表中并在此处执行某种流式传输。

对于所有感兴趣的人,这是我的小测试代码(java 8),生成所需大小USED_INT_VALUES的测试文件。我现在限制它为5 000 000个整数。正如您所看到的那样,在读取文件时,内存会稳定增加。拥有这么多记忆的唯一地方是数字 List

请注意,初始化具有初始容量ArrayList 会分配存储对象所需的内存,在您的情况下Integers

import java.io.File;
import java.io.FileReader;
import java.io.FileWriter;
import java.io.IOException;
import java.io.StreamTokenizer;
import java.util.ArrayList;
import java.util.List;
import java.util.Random;
import java.util.logging.Level;
import java.util.logging.Logger;

public class TestBigFiles {

    public static void main(String args[]) throws IOException {
        heapStatistics("program start");
        final int USED_INT_VALUES = 5000000;
        File tempFile = File.createTempFile("testdata_big_50m", ".txt");
        System.out.println("using file " + tempFile.getAbsolutePath());
        tempFile.deleteOnExit();

        Random rand = new Random();
        FileWriter writer = new FileWriter(tempFile);
        rand.ints(USED_INT_VALUES).forEach(i -> {
            try {
                writer.write(i + " ");
            } catch (IOException ex) {
                Logger.getLogger(TestBigFiles.class.getName()).log(Level.SEVERE, null, ex);
            }
        });
        writer.close();
        heapStatistics("large file generated - size=" + tempFile.length() + "Bytes");
        List<Integer> numbers = new ArrayList<>(USED_INT_VALUES);

        heapStatistics("large array allocated (to avoid array copy)");

        int c = 0;
        try (FileReader fileReader = new FileReader(tempFile);) {
            StreamTokenizer tokenizer = new StreamTokenizer(fileReader);

            while (tokenizer.nextToken() != StreamTokenizer.TT_EOF) {
                if (tokenizer.ttype == StreamTokenizer.TT_NUMBER) {
                    numbers.add((int) tokenizer.nval);
                    c++;
                }
                if (c % 100000 == 0) {
                    heapStatistics("within loop count " + c);
                }
            }
        }

        heapStatistics("large file parsed nummer list size is " + numbers.size());
    }

    private static void heapStatistics(String message) {
        int MEGABYTE = 1024 * 1024;
        //clean up unused stuff
        System.gc();
        Runtime runtime = Runtime.getRuntime();
        System.out.println("##### " + message + " #####");

        System.out.println("Used Memory:" + (runtime.totalMemory() - runtime.freeMemory()) / MEGABYTE + "MB"
                + " Free Memory:" + runtime.freeMemory() / MEGABYTE + "MB"
                + " Total Memory:" + runtime.totalMemory() / MEGABYTE + "MB"
                + " Max Memory:" + runtime.maxMemory() / MEGABYTE + "MB");
    }
}