我有一个包含180列数字和大约60000行的ascii文件。文件大小约为80MB。
我需要将该文件读入大小为180x60000的二维数组中。
文件结构示例:
gsrv01:946177 946061 .. [多栏] .. 8359486 8359485 0总计184
.. [很多行] ..
gsrv01:945998 946259 .. [很多栏目] .. 8359489 8359487 1总计184
当我正在阅读这个文件时,我的内存使用量约为800MB。我在GUI应用程序中使用此文件中的数据,因此总内存量超过1200MB。这是不可接受的。
我正在读书吗?如何减少内存使用量?
import java.io.BufferedReader;
import java.io.FileInputStream;
import java.io.IOException;
import java.io.InputStreamReader;
import java.util.ArrayList;
import java.util.Arrays;
import java.util.List;
import java.util.regex.Matcher;
import java.util.regex.Pattern;
public class ReadBigData {
public static void main(String[] args){
String pathFilename = "E:\\data\\8.txt";
long startTime = System.nanoTime();
new ReadBigData(pathFilename);
long endTime = System.nanoTime();
long duration = (endTime - startTime); //divide by 1000000 to get milliseconds.
double dur = (double) duration/1000000/1000;
System.out.println("Elapsed: " + dur + " sec.");
try {
System.in.read(); //to wait after execution.
} catch (IOException e) {
e.printStackTrace();
}
}
public ReadBigData(String pathFilename){
//list for containing data
List<List<Double>> dataTableList = new ArrayList<List<Double>>();
Pattern spacePattern = Pattern.compile("\\s+"); //split by whitespace or tab
String regex = "^gsrv01:\\s+(.*)\\s+(\\d+)\\s+end total.*";//. -- any symbol, * -- repeated zero or more times.
Pattern pattern = Pattern.compile(regex);
try {
FileInputStream inputStream = new FileInputStream(pathFilename);
BufferedReader bufferedReader = new BufferedReader(new InputStreamReader(inputStream));
String line = null;
while ((line = bufferedReader.readLine()) != null) {
Matcher matches = pattern.matcher(line);
while(matches.find()){
//slow!!!!!!!!!!!!
String columnsStr = matches.group(1);
List<String> columnsList = Arrays.asList(spacePattern.split(columnsStr, 0)); //fast
List<Double> list = new ArrayList<Double>();
for (String str : columnsList) {
list.add(Double.parseDouble(str));
}
dataTableList.add(list);
}
}
inputStream.close();
} catch (IOException e) {
e.printStackTrace();
}
//list to array
Double[][] dataTable = new Double[dataTableList.size()][];
for (int i = 0; i < dataTableList.size(); i++) {
List<Double> row = dataTableList.get(i);
dataTable[i] = row.toArray(new Double[row.size()]);
}
}
}
答案 0 :(得分:1)
有一个API for processing unquantifiable sets of data。根据数字的数量,您可能希望删除嵌套流并只使用for循环。
public static List<double[]> read(String pathFilename) {
Pattern pattern = Pattern.compile("^gsrv01:\\s+(.*)\\s+(\\d+)\\s+end total.*");
try(FileInputStream in = new FileInputStream(pathFilename);
InputStreamReader stream = new InputStreamReader(in);
BufferedReader reader = new BufferedReader(stream)) {
return reader.lines()
.map(pattern::matcher)
.filter(Matcher::matches)
.map(matcher -> matcher.group(1))
.map(s -> s.split("\\s+"))
.map(strings -> Arrays.stream(strings)
.mapToDouble(Double::parseDouble)
.toArray())
.collect(Collectors.toList());
} catch (IOException e) {
return Collections.emptyList();
}
}
public static void main(String[] args) {
System.out.println(read("8.txt").size());
}
此方法在我6岁的笔记本电脑上在不到3秒的时间内从您附加的80Mb文件中解析了59292行数字
答案 1 :(得分:0)
尝试删除List<List<>>
,不必要的RegEx,并使用double
代替Double
,如下所示:
public double[][] readBigData(String pathFilename)
{
// list for containing data
final List<double[]> dataTableList = new ArrayList<>();
final Pattern spacePattern = Pattern.compile("\\s+"); //split by whitespace or tab
try (final BufferedReader bufferedReader = new BufferedReader(new InputStreamReader(
new FileInputStream(pathFilename))))
{
final String line = bufferedReader.readLine();
while (line != null)
{
final String[] fields = spacePattern.split(line, 0);
final int l = fields.length;
// Check the format
if ("gsrv01:".equals(fields[0]) && "end".equals(fields[l-3]) &&
"total".equals(fields[l-2]))
{
final double[] list = new double[l-5];
for (int i = 1; i < l-4; ++i)
{
list[i-1] = Double.parseDouble(fields[i]);
}
dataTableList.add(list);
}
}
}
catch (IOException e)
{
e.printStackTrace();
}
// list to array
return dataTableList.toArray(new double[dataTable.size()][]);
}
也不应该在构造函数中处理数据......
答案 2 :(得分:-1)
除了上面列出的内容之外,我还有另一个小观察:
替换以下声明
List<Double> list = new ArrayList<Double>();
与
List<Double> list = new ArrayList<Double>(columnsList.size());
这样,您可以阻止在扩展
使用数组代替List。有了这个,你将防止数据从数组复制到列表,反之亦然
Double[] coll = new Double[columnList.size()];
而不是
List<Double> list = new ArrayList<Double>();