Question

我正在将matlab脚本转换为numpy，但在从二进制文件中读取数据时遇到一些问题。使用fseek跳过文件开头时，fromfile是否有等效？这是我需要做的提取类型：

fid = fopen(fname);
fseek(fid, 8, 'bof');
second = fread(fid, 1, 'schar');
fseek(fid, 100, 'bof');
total_cycles = fread(fid, 1, 'uint32', 0, 'l');
start_cycle = fread(fid, 1, 'uint32', 0, 'l');

谢谢！

Answer 1

您可以以正常方式对文件对象使用seek，然后在fromfile中使用此文件对象。这是一个完整的例子：

import numpy as np
import os

data = np.arange(100, dtype=np.int)
data.tofile("temp")  # save the data

f = open("temp", "rb")  # reopen the file
f.seek(256, os.SEEK_SET)  # seek

x = np.fromfile(f, dtype=np.int)  # read the data into numpy
print x 
# [64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88
# 89 90 91 92 93 94 95 96 97 98 99]

Answer 2

可能有一个更好的答案......但是当我遇到这个问题时，我有一个文件，我已经想要分别访问不同的部分，这给了我一个简单的解决方案来解决这个问题。

例如，假设chunkyfoo.bin是一个由6字节头，1024字节numpy数组和另一个1024字节numpy数组组成的文件。你不能只打开文件并寻找6个字节（因为numpy.fromfile做的第一件事是lseek回到0）。但您只需mmap该文件并使用fromstring代替：

with open('chunkyfoo.bin', 'rb') as f:
    with closing(mmap.mmap(f.fileno(), length=0, access=mmap.ACCESS_READ)) as m:
        a1 = np.fromstring(m[6:1030])
        a2 = np.fromstring(m[1030:])

这听起来就像你想要做的那样。当然，除了在现实生活中，a1和a2的偏移量和长度可能取决于标题，而不是固定的注释。

标题只是m[:6]，您可以通过使用struct模块明确地将其拆分，或者在read数据后执行其他任何操作来解析它。但是，如果您愿意，可以在构建seek之前，read和f明确m和m，或者在{{1}之后甚至进行相同的调用}，它会起作用，而不会影响a1和a2。

我为另一个非numpy相关项目做的另一种选择是创建一个包装器文件对象，如下所示：

class SeekedFileWrapper(object):
    def __init__(self, fileobj):
        self.fileobj = fileobj
        self.offset = fileobj.tell()
    def seek(self, offset, whence=0):
        if whence == 0:
            offset += self.offset
        return self.fileobj.seek(offset, whence)
    # ... delegate everything else unchanged

我通过在构造时生成list属性并在__getattr__中使用它来做“委托其他所有未更改的”，但是你可能想要一些不那么hacky的东西。 numpy仅依赖于类文件对象的少数方法，我认为它们已被正确记录，因此只需明确委托它们。但我认为mmap解决方案在这里更有意义，除非你试图机械地移植一堆基于seek的显式代码。（您认为mmap还可以选择将其保留为numpy.memmap而不是numpy.array，这样numpy可以让numpy.memmap更多地控制/反馈分页等等。但让mmap和{{1}}一起工作真的很棘手。）

Answer 3

当我必须在异构二进制文件中任意读取时，这就是我所做的 Numpy允许通过更改数组的dtype以仲裁方式解释位模式。问题中的Matlab代码为char和两个uint。

阅读这篇paper（在用户层面上，不是科学家的简单阅读），通过改变数组的dtype，stride和维度，可以实现什么。

import numpy as np

data = np.arange(10, dtype=np.int)
data.tofile('f')

x = np.fromfile('f', dtype='u1')
print x.size
# 40

second = x[8]
print 'second', second
# second 2

total_cycles = x[8:12]
print 'total_cycles', total_cycles
total_cycles.dtype = np.dtype('u4')
print 'total_cycles', total_cycles
# total_cycles [2 0 0 0]       !endianness
# total_cycles [2]

start_cycle = x[12:16]
start_cycle.dtype = np.dtype('u4')
print 'start_cycle', start_cycle
# start_cycle [3]

x.dtype = np.dtype('u4')
print 'x', x
# x [0 1 2 3 4 5 6 7 8 9]

x[3] = 423 
print 'start_cycle', start_cycle
# start_cycle [423]

Answer 4

numpy.fromfile()

有一个相当新的功能

偏移量 int

文件当前位置的偏移量（以字节为单位）。默认值为0。仅允许用于二进制文件。

1.17.0版中的新功能。

import numpy as np
import os

data = np.arange(100, dtype=np.int32)
data.tofile("temp")  # save the data

x = np.fromfile("temp", dtype=np.int32, offset=256)  # use the offset
print (x)
# [64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88
# 89 90 91 92 93 94 95 96 97 98 99]

如何用numpy读取部分二进制文件？

4 个答案: