Question

背景：

使用以下Fortran代码在Linux机器上读取二进制文件：

        parameter(nx=720, ny=360, nday=365)
c 
        dimension tmax(nx,ny,nday),nmax(nx,ny,nday)
        dimension tmin(nx,ny,nday),nmin(nx,ny,nday)
c 
        open(10,
     &file='FILE',
     &access='direct',recl=nx*ny*4)
c
        do k=1,nday
        read(10,rec=(k-1)*4+1)((tmax(i,j,k),i=1,nx),j=1,ny) 
        read(10,rec=(k-1)*4+2)((nmax(i,j,k),i=1,nx),j=1,ny) 
        read(10,rec=(k-1)*4+3)((tmin(i,j,k),i=1,nx),j=1,ny) 
        read(10,rec=(k-1)*4+4)((nmin(i,j,k),i=1,nx),j=1,ny) 
        end do

文件详细信息：

options  little_endian
title global daily analysis (grid box mean, the grid shown is the center of the grid box)
undef -999.0
xdef 720 linear    0.25 0.50
ydef 360  linear -89.75 0.50
zdef 1 linear 1 1
tdef 365 linear 01jan2015 1dy
vars 4
tmax     1  00 daily maximum temperature (C)
nmax     1  00 number of reports for maximum temperature (C)
tmin     1  00 daily minimum temperature (C)
nmin     1  00 number of reports for minimum temperature (C)
ENDVARS

尝试解决方案：

我正在尝试使用以下代码（故意省略两个属性）将其解析为python中的数组：

with gzip.open("/FILE.gz", "rb") as infile:
     data = numpy.frombuffer(infile.read(), dtype=numpy.dtype('<f4'), count = -1)

while x <= len(data) / 4:
    tmax.append(data[(x-1)*4])
    tmin.append(data[(x-1)*4 + 2])
    x += 1

data_full = zip(tmax, tmin)

在测试某些记录时，使用Fortran时，数据似乎与文件中的某些示例记录不一致。我也尝试过dtype=numpy.float32，但没有成功。不过，就观察次数而言，似乎我正在正确地读取文件。在我了解到使用Fortran创建文件之前，我还使用过struct。那没用

这里也有类似的问题，其中一些问题是我尝试运气不佳的答案。

更新

尝试以下代码后，我有点靠近了：

#Define numpy variables and empty arrays
nx = 720 #number of lon
ny = 360 #number of lat
nday = 0 #iterate up to 364 (or 365 for leap year)   
tmax = numpy.empty([0], dtype='<f', order='F')
tmin = numpy.empty([0], dtype='<f', order='F')

#Parse the data into numpy arrays, shifting records as the date increments
while nday < 365:
    tmax = numpy.append(tmax, data[(nx*ny)*nday:(nx*ny)*(nday + 1)].reshape((nx,ny), order='F'))
    tmin = numpy.append(tmin, data[(nx*ny)*(nday + 2):(nx*ny)*(nday + 3)].reshape((nx,ny), order='F'))
    nday += 1

第一天我得到了正确的数据，但是第二天我得到了全零，第三天我的最大值低于最小值，依此类推。

Answer 1

尽管Fortran二进制文件的确切格式取决于编译器，但在所有情况下，我都知道直接访问文件（如本问题中使用access='direct'打开的文件）在记录之间没有任何记录标记。每个记录的大小都是固定的，如recl=语句中的OPEN说明符所给。也就是说，记录N从文件中的偏移量(N - 1) * RECL字节开始。

一个可移植性的陷阱是，recl=的单位是file storage unit。对于大多数编译器，file storage unit以8位八位位组指定大小（如在Fortran标准的最新版本中所建议的那样），但是对于Intel Fortran编译器，recl=以32位为单位。有一个命令行选项-assume byterecl可用于使Intel Fortran与大多数其他编译器匹配。

因此，在此处给出的示例中，假设使用8位file storage unit，则您的记录将为1036800字节。

再看一下代码，似乎假设数组是4字节类型的（例如整数或单精度实数）。因此，如果它是单精度实数，并且文件是以little endian格式创建的，那么您使用的numpy dtype <f4似乎是正确的选择。

现在，回到Intel Fortran编译器上，如果文件是由ifort创建的，没有-assume byterecl，则所需的数据将位于每条记录的前四分之一，其余的将被填充（全部零甚至随机数据？）。然后，您将不得不做一些额外的体操运动，以在python中提取正确的数据，而不是填充。通过检查文件的大小（nx * ny * 4 * nday *4还是nx * ny * 4 * nday * 4 * 4字节）来检查它应该很容易？

Answer 2

在我的问题中更新之后，我意识到我在循环方式方面存在错误。我当然会在发放赏金大约10分钟后发现这一点，很好。

错误在于使用日期来遍历记录。这将不起作用，因为每个循环迭代一次，没有将记录推得足够远。因此，为什么有些分钟高于最大值。新的代码是：

while nday < 365:
    tmax = numpy.append(tmax, data[(nx*ny)*rm:(nx*ny)*(rm + 1)].reshape((nx,ny), order='F'))
    rm = rm + 2
    tmin = numpy.append(tmin, data[(nx*ny)*rm:(nx*ny)*(rm + 1)].reshape((nx,ny), order='F'))
    rm = rm + 2
    nday += 1

这使用记录移动器（或我称之为rm）将记录移动适当的数量。那就足够了。

在Python中读取直接访问二进制文件格式

2 个答案: