Question

我有一个包含180列数字和大约60000行的ascii文件。文件大小约为80MB。

我需要将该文件读入大小为180x60000的二维数组中。

文件结构示例：

gsrv01：946177 946061 .. [多栏] .. 8359486 8359485 0总计184

.. [很多行] ..

gsrv01：945998 946259 .. [很多栏目] .. 8359489 8359487 1总计184

当我正在阅读这个文件时，我的内存使用量约为800MB。我在GUI应用程序中使用此文件中的数据，因此总内存量超过1200MB。这是不可接受的。

我正在读书吗？如何减少内存使用量？

import java.io.BufferedReader;
import java.io.FileInputStream;
import java.io.IOException;
import java.io.InputStreamReader;
import java.util.ArrayList;
import java.util.Arrays;
import java.util.List;
import java.util.regex.Matcher;
import java.util.regex.Pattern;

public class ReadBigData {

public static void main(String[] args){

    String pathFilename = "E:\\data\\8.txt";

    long startTime = System.nanoTime();
    new ReadBigData(pathFilename);
    long endTime = System.nanoTime();

    long duration = (endTime - startTime);  //divide by 1000000 to get milliseconds.
    double dur = (double) duration/1000000/1000;
    System.out.println("Elapsed: " + dur + " sec.");

    try {
        System.in.read(); //to wait after execution.
    } catch (IOException e) {
        e.printStackTrace();
    }


}

public ReadBigData(String pathFilename){

    //list for containing data
    List<List<Double>> dataTableList = new ArrayList<List<Double>>();

    Pattern spacePattern = Pattern.compile("\\s+"); //split by whitespace or tab

    String regex = "^gsrv01:\\s+(.*)\\s+(\\d+)\\s+end total.*";//. -- any symbol, * -- repeated zero or more times.
    Pattern pattern = Pattern.compile(regex);

    try {
        FileInputStream inputStream = new FileInputStream(pathFilename);
        BufferedReader bufferedReader = new BufferedReader(new InputStreamReader(inputStream));
        String line = null;
        while ((line = bufferedReader.readLine()) != null) {

            Matcher matches = pattern.matcher(line);
            while(matches.find()){

                //slow!!!!!!!!!!!!

                String columnsStr =  matches.group(1);
                List<String> columnsList = Arrays.asList(spacePattern.split(columnsStr, 0)); //fast

                List<Double> list = new ArrayList<Double>();
                for (String str : columnsList) {
                    list.add(Double.parseDouble(str));
                }
                dataTableList.add(list);
            }
        }
        inputStream.close();
    } catch (IOException e) {
        e.printStackTrace();
    }
    //list to array
    Double[][] dataTable = new Double[dataTableList.size()][];
    for (int i = 0; i < dataTableList.size(); i++) {
        List<Double> row = dataTableList.get(i);
        dataTable[i] = row.toArray(new Double[row.size()]);
    }

}
}

File link[80MB]

Answer 1

有一个API for processing unquantifiable sets of data。根据数字的数量，您可能希望删除嵌套流并只使用for循环。

public static  List<double[]> read(String pathFilename) {

    Pattern pattern = Pattern.compile("^gsrv01:\\s+(.*)\\s+(\\d+)\\s+end total.*");

    try(FileInputStream in = new FileInputStream(pathFilename);
        InputStreamReader stream = new InputStreamReader(in);
        BufferedReader reader = new BufferedReader(stream)) {

        return reader.lines()
                .map(pattern::matcher)
                .filter(Matcher::matches)
                .map(matcher -> matcher.group(1))
                .map(s -> s.split("\\s+"))
                .map(strings -> Arrays.stream(strings)
                        .mapToDouble(Double::parseDouble)
                        .toArray())
                .collect(Collectors.toList());

    } catch (IOException e) {
        return Collections.emptyList();
    }
}

public static void main(String[] args) {
    System.out.println(read("8.txt").size());
}

此方法在我6岁的笔记本电脑上在不到3秒的时间内从您附加的80Mb文件中解析了59292行数字

Answer 2

尝试删除List<List<>>，不必要的RegEx，并使用double代替Double，如下所示：

public double[][] readBigData(String pathFilename)
{
   // list for containing data
   final List<double[]> dataTableList = new ArrayList<>();
   final Pattern spacePattern = Pattern.compile("\\s+"); //split by whitespace or tab
   try (final BufferedReader bufferedReader = new BufferedReader(new InputStreamReader(
                                              new FileInputStream(pathFilename))))
   {
      final String line = bufferedReader.readLine();
      while (line != null)
      {
         final String[] fields = spacePattern.split(line, 0);
         final int l = fields.length;
         // Check the format
         if ("gsrv01:".equals(fields[0]) && "end".equals(fields[l-3]) &&
             "total".equals(fields[l-2]))
         {
            final double[] list = new double[l-5];
            for (int i = 1; i < l-4; ++i)
            {
               list[i-1] = Double.parseDouble(fields[i]);
            }
            dataTableList.add(list);
         }
      }
   }
   catch (IOException e)
   {
      e.printStackTrace();
   }

   // list to array
   return dataTableList.toArray(new double[dataTable.size()][]);
}

也不应该在构造函数中处理数据......

Answer 3

除了上面列出的内容之外，我还有另一个小观察：

替换以下声明

List<Double> list = new ArrayList<Double>();

与

List<Double> list = new ArrayList<Double>(columnsList.size());

这样，您可以阻止在扩展

使用数组代替List。有了这个，你将防止数据从数组复制到列表，反之亦然
```
Double[] coll = new Double[columnList.size()];
```
而不是
```
List<Double> list = new ArrayList<Double>();
```

读取大文件时的堆大小非常大

3 个答案: