Question

在Octave中，我正在从磁盘读取非常大的文本文件并进行解析。函数textread（）只是实现了我想要的功能。查看源代码，textread.m在尝试解析行之前将整个文本文件拉入内存。如果文本文件很大，它将在文本中填充我所有的可用RAM（16 GB），然后在解析之前开始将其保存回磁盘（虚拟内存）。如果我等待足够长的时间，textread（）将完成，但是几乎要花很多时间。

请注意，在解析为浮点值矩阵之后，相同的数据非常容易放入内存。因此，我在中间区域使用textread（），该区域有足够的存储空间用于浮点数，但没有足够的存储空间来存储与文本相同的数据。

所有这些都是为我的问题做准备，该问题与strread（）有关。我的文本文件中的数据如下所示：

0.0647148      -2.0072535       0.5644875       8.6954257
0.1294296      -8.4689583       0.6567095       144.3090450
0.1941444      -9.2658037      -1.0228742       173.8027785
0.2588593      -6.5483359      -1.5767574       90.7337329
0.3235741      -0.7646807      -0.5320896       1.7357120

...等等。文件中没有标题行或注释。

我编写了一个函数，它逐行读取文件，并注意到我尝试使用strread（）解析数据行的两种方式。

function dest = readPowerSpectrumFile(filename, dest)

  % read enough lines to fill destination array
  [rows, cols] = size(dest);

  fid = fopen(filename, 'r');

  for line = 1 : rows
    lstr = fgetl(fid);

% this line works, but is very brittle
    [dest(line, 1), dest(line, 2), dest(line, 3), dest(line, 4)]  = strread(lstr, "%f %f %f %f");

% This line doesn't work. Or anything similar I can think of.
%    dest(line, 1:4) = strread(lstr, "%f %f %f %f");

  endfor

  fclose(fid);

endfunction

是否有一种优雅的方法可以将解析后的值返回数组？否则，每次更改列数时，我都必须编写一个新函数。

谢谢

Answer 1

如果您输入的fprintf值多于其格式说明中的值，它将重新应用print语句，直到用尽它们为止：

>> fprintf("%d %d \n", 1:6)
1 2
3 4
5 6

看来这也适用于strread。如果您仅指定一个要读取的值，但是当前行上有多个值，它将继续读取它们并将它们添加到列向量中。我们需要做的就是将这些值分配给dest的正确行：

function dest = readPowerSpectrumFile(filename, dest)

   % read enough lines to fill destination array
   [rows, cols] = size(dest);

   fid = fopen(filename, 'r');

   for line = 1 : rows
      lstr = fgetl(fid);

      % read all values from current line into column vector 
      % and store values into row of dest
      dest(line,:) = strread(lstr, "%f");
      % this will also work since values are assumed to be numeric by default:
      % dest(line,:) = strread(lstr);
   endfor

   fclose(fid);

endfunction

输出：

readPowerSpectrumFile(filename, zeros(5,4))
ans =

   6.4715e-02  -2.0073e+00   5.6449e-01   8.6954e+00
   1.2943e-01  -8.4690e+00   6.5671e-01   1.4431e+02
   1.9414e-01  -9.2658e+00  -1.0229e+00   1.7380e+02
   2.5886e-01  -6.5483e+00  -1.5768e+00   9.0734e+01
   3.2357e-01  -7.6468e-01  -5.3209e-01   1.7357e+00

Answer 2

您描述的格式是带有浮点值的矩阵。在这种情况下，您可以只使用public class UTF8Tester { public static void main(String args[]) throws Exception { BufferedReader stdinReader = new BufferedReader(new InputStreamReader(System.in, StandardCharsets.UTF_8)); String[] stdinData = readLines(stdinReader); printToFile(stdinData, "stdin_out.txt"); BufferedReader fileReader = new BufferedReader(new FileReader("testinput.txt")); String[] fileData = readLines(fileReader); printToFile(fileData, "file_out.txt"); } private static void printToFile(String[] data, String fileName) throws FileNotFoundException, UnsupportedEncodingException { PrintWriter writer = new PrintWriter(fileName, "UTF-8"); for (String line : data) { writer.println(line); } writer.close(); } private static String[] readLines(BufferedReader reader) throws IOException { ArrayList<String> corpus = new ArrayList<String>(); String inputString = null; while ((inputString = reader.readLine()) != null) { corpus.add(inputString); } String[] allCorpus = new String[corpus.size()]; return corpus.toArray(allCorpus); } }

load

，它比任何其他功能快得多。您可以在libinterp / corefcn / ls-mat-ascii.cc中查看使用的实现：read_mat_ascii_data

八度扩展无法将解析的结果返回到数组（？）

2 个答案: