Question

我正在尝试使用Python读取二进制文件。有人使用以下代码读取了带有R的数据：

x <- readBin(webpage, numeric(), n=6e8, size = 4, endian = "little")
      myPoints <- data.frame("tmax" = x[1:(length(x)/4)],
                             "nmax" = x[(length(x)/4 + 1):(2*(length(x)/4))],
                             "tmin" = x[(2*length(x)/4 + 1):(3*(length(x)/4))],
                             "nmin" = x[(3*length(x)/4 + 1):(length(x))])

使用Python，我正在尝试以下代码：

import struct

with open('file','rb') as f:
    val = f.read(16)
    while val != '':
        print(struct.unpack('4f', val))
        val = f.read(16)

我得到的结果略有不同。例如，R中的第一行返回4列，分别是-999.9、0，-999.0、0。Python的所有四列都返回-999.0（下图）。

Python输出：

R输出：

我知道他们正在使用某些[]代码按文件的长度进行切片，但是我不知道如何在Python中精确地做到这一点，也不知道他们为什么这样做。基本上，我想重新创建R在Python中所做的事情。

如果需要，我可以提供更多的任一代码库。我不想淹没不必要的代码。

Answer 1

根据R代码，二进制文件首先包含一定数量的tmax，然后是相同数量的nmax，然后是tmin和{{1 }}。代码要做的是读取整个文件，然后使用切片将其切成4部分（tmax，nmax等）。

要在python中做同样的事情：

nmin

如果目标是将这些数据构造为点（？）的列表，例如import struct # Read entire file into memory first. This is done so we can count # number of bytes before parsing the bytes. It is not a very memory # efficient way, but it's the easiest. The R-code as posted wastes even # more memory: it always takes 6e8 * 4 bytes (~ 2.2Gb) of memory no # matter how small the file may be. # data = open('data.bin','rb').read() # Calculate number of points in the file. This is # file-size / 16, because there are 4 numeric()'s per # point, and they are 4 bytes each. # num = int(len(data) / 16) # Now we know how much there are, we take all tmax numbers first, then # all nmax's, tmin's and lastly all nmin's. # First generate a format string because it depends on the number points # there are in the file. It will look like: "fffff" # format_string = 'f' * num # Then, for cleaner code, calculate chunk size of the bytes we need to # slice off each time. # n = num * 4 # 4-byte floats # Note that python has different interpretation of slicing indices # than R, so no "+1" is needed here as it is in the R code. # tmax = struct.unpack(format_string, data[:n]) nmax = struct.unpack(format_string, data[n:2*n]) tmin = struct.unpack(format_string, data[2*n:3*n]) nmin = struct.unpack(format_string, data[3*n:]) print("tmax", tmax) print("nmax", nmax) print("tmin", tmin) print("nmin", nmin)，则将其附加到代码中：

(tmax,nmax,tmin,nmin)

Answer 2

这是一种减少内存消耗的方法。它可能也快一点。（但这对我来说很难检查）

我的计算机没有足够的内存来运行包含这些大文件的第一个程序。这个确实可以，但是我仍然需要创建ony tmax的第一个列表（文件的前1/4），然后打印它，然后删除该列表，以便为nmax，tmin和nmin提供足够的内存。 / p>

但是这个人也说2018文件中的nmin全部为-999.0。如果那没有意义，那么您能检查一下R代码的含义吗？我怀疑这只是文件中的内容。当然，另一种可能性是我弄错了（我怀疑）。但是，我也尝试了2017年的文件，但这个文件没有这样的问题：所有tmax，nmax，tmin和nmin的值都约为37％-999.0。

无论如何，这是第二个代码：

import os
import struct

# load_data()
#   data_store : object to append() data items (floats) to
#   num        : number of floats to read and store
#   datafile   : opened binary file object to read float data from
#
def load_data(data_store, num, datafile):
    for i in range(num):
        data = datafile.read(4)  # process one float (=4 bytes) at a time
        item = struct.unpack("<f", data)[0]  # '<' means little endian
        data_store.append(item) 

# save_list() saves a list of float's as strings to a file
#
def save_list(filename, datalist):
    output = open(filename, "wt")
    for item in datalist:
        output.write(str(item) + '\n')
    output.close()

#### MAIN ####

datafile = open('data.bin','rb')

# Get file size so we can calculate number of points without reading
# the (large) file entirely into memory.
#
file_info = os.stat(datafile.fileno())

# Calculate number of points, i.e. number of each tmax's, nmax's,
# tmin's, nmin's. A point is 4 floats of 4 bytes each, hence number
# of points = file-size / (4*4)
#
num = int(file_info.st_size / 16)

tmax_list = list()
load_data(tmax_list, num, datafile)
save_list("tmax.txt", tmax_list)
del tmax_list   # huge list, save memory

nmax_list = list()
load_data(nmax_list, num, datafile)
save_list("nmax.txt", nmax_list)
del nmax_list   # huge list, save memory

tmin_list = list()
load_data(tmin_list, num, datafile)
save_list("tmin.txt", tmin_list)
del tmin_list   # huge list, save memory

nmin_list = list()
load_data(nmin_list, num, datafile)
save_list("nmin.txt", nmin_list)
del nmin_list   # huge list, save memory

R readBin与Python结构

2 个答案: