Question

我有一个格式化的数据文件，通常是数十亿行，有几行可变长度的标题。数据文件采用以下形式：

    # header 1
    # header 2
    # headers are of variable length.
    # data begins from next line.
    1.23  4.56  7.89  0.12
    2.34  5.67  8.90  1.23
    :
    :
    # billions of lines of data, each row the same length, same format.
    -- end of file --

我想从这个文件中提取一部分数据，我当前的代码如下：

<pre>
do j=1,jmax !Suppose I want to extract jmax lines of data from the file.

  [algorithm to determine number of lines to skip, "N(j)"]
  !This determines the number of lines to skip from the previous file
  !position, when the data was read on j-1th iteration.

  !Skip N-1 lines to go to the next data line to read off:
  do i=1,N-1
    read(unit=unit,fmt='(A)')
  end do
  !Now read off the line of data I want:
  read(unit=unit,fmt='(data_format)'),data1,data2,etc.
  !Data is stored in some arrays.
end do
</pre>

问题是，N（j）可以是1到几十亿之间的任何值，因此运行代码需要一些时间。

我的问题是，是否有更有效的方式来跳过数百万行数据？在坚持使用Fortran时，我能想到的唯一方法是打开文件直接访问并在打开文件时跳转到所需的行。

Answer 1

正如您所说，直接访问似乎是最佳选择。但是这要求所有记录都具有相同的长度，这是您的标题违反的。另外，为什么使用格式化输出？有了这个长度的文件，很难想象一个人在阅读文件。如果使用未格式化的IO，则文件将更小，IO将更快。也许创建两个文件，一个是人类阅读器形式的标题（元数据），另一个是本机形式的数据。本机/二进制表示意味着丢失可移植性，如果您想将文件移动到不同的计算机体系结构或使其可用数十年，则需要考虑这一点。否则它可能值得方便。其他选择是使用更复杂的文件格式，结合元数据和数据，如HDF5或FITS，但对于一个人的两个程序之间的通信，这可能是过度的。

Fortran：如何有效地跳过多行数据文件

1 个答案: