我有一个使用file.read(len)
方法处理文件中二进制数据的函数。但是,我的文件很大,并且被分成许多小文件,每个50 MB。是否有一些包装类将许多文件提供给缓冲流,并提供read()方法?
类fileinput.FileInput
可以做这样的事情,但它只支持逐行读取(没有参数的方法readline()
)并且没有指定字节数的read(len)
阅读。
答案 0 :(得分:4)
将iterables与itertools.chain
连接起来非常容易:
from itertools import chain
def read_by_chunks(file_objects, block_size=1024):
readers = (iter(lambda f=f: f.read(block_size), '') for f in file_objects)
return chain.from_iterable(readers)
然后你可以这样做:
for chunk in read_by_chunks([f1, f2, f3, f4], 4096):
handle(chunk)
按照4096
字节的块读取文件时按顺序处理文件。
如果你需要提供一个read
方法的对象,因为其他一些函数希望你能编写一个非常简单的包装器:
class ConcatFiles(object):
def __init__(self, files, block_size):
self._reader = read_by_chunks(files, block_size)
def __iter__(self):
return self._reader
def read(self):
return next(self._reader, '')
然而,这只使用固定的块大小。通过执行以下操作,可以支持block_size
的{{1}}参数:
read
注意:如果您正在以二进制模式阅读,则应使用空字节def read(self, block_size=None):
block_size = block_size or self._block_size
total_read = 0
chunks = []
for chunk in self._reader:
chunks.append(chunk)
total_read += len(chunk)
if total_read > block_size:
contents = ''.join(chunks)
self._reader = chain([contents[block_size:]], self._reader)
return contents[:block_size]
return ''.join(chunks)
替换代码中的空字符串''
。
答案 1 :(得分:2)
我不熟悉执行该功能的标准库中的任何内容,因此,如果没有:
try:
from cStringIO import StringIO
except ImportError:
from StringIO import StringIO
class ConcatenatedFiles( object ):
def __init__(self, file_objects):
self.fds= list(reversed(file_objects))
def read( self, size=None ):
remaining= size
data= StringIO()
while self.fds and (remaining>0 or remaining is None):
data_read= self.fds[-1].read(remaining or -1)
if len(data_read)<remaining or remaining is None: #exhausted file
self.fds.pop()
if not remaining is None:
remaining-=len(data_read)
data.write(data_read)
return data.getvalue()
答案 2 :(得分:1)
不是将流列表转换为生成器 - 而是像其他一些答案那样 - 您可以将流链接在一起,然后使用文件界面:
def chain_streams(streams, buffer_size=io.DEFAULT_BUFFER_SIZE):
"""
Chain an iterable of streams together into a single buffered stream.
Usage:
def generate_open_file_streams():
for file in filenames:
yield open(file, 'rb')
f = chain_streams(generate_open_file_streams())
f.read()
"""
class ChainStream(io.RawIOBase):
def __init__(self):
self.leftover = b''
self.stream_iter = iter(streams)
try:
self.stream = next(self.stream_iter)
except StopIteration:
self.stream = None
def readable(self):
return True
def _read_next_chunk(self, max_length):
# Return 0 or more bytes from the current stream, first returning all
# leftover bytes. If the stream is closed returns b''
if self.leftover:
return self.leftover
elif self.stream is not None:
return self.stream.read(max_length)
else:
return b''
def readinto(self, b):
buffer_length = len(b)
chunk = self._read_next_chunk(buffer_length)
while len(chunk) == 0:
# move to next stream
if self.stream is not None:
self.stream.close()
try:
self.stream = next(self.stream_iter)
chunk = self._read_next_chunk(buffer_length)
except StopIteration:
# No more streams to chain together
self.stream = None
return 0 # indicate EOF
output, self.leftover = chunk[:buffer_length], chunk[buffer_length:]
b[:len(output)] = output
return len(output)
return io.BufferedReader(ChainStream(), buffer_size=buffer_size)
然后将其用作任何其他文件/流:
f = chain_streams(open_files_or_chunks)
f.read(len)
答案 3 :(得分:0)
另一种方法是使用生成器:
def read_iter(streams, block_size=1024):
for stream in streams:
for chunk in stream.read(block_size):
yield chunk
# open file handles
file1 = open('f1.txt', 'r')
file2 = open('f2.txt', 'r')
fileOut = open('out.txt', 'w')
# concatenate files 1 & 2
for chunk in read_iter([file1, file2]):
# process chunk (in this case, just concatenate to output)
fileOut.write(chunk)
# close files
file1.close()
file2.close()
fileOut.close()
这不应该消耗超出基本脚本所需的内存和块大小;它将每个块直接从一个文件读取器传递给另一个文件读取器,然后重复,直到所有流完成。
如果你在类中需要这种行为,可以很容易地将它构建到容器类中,正如Bakuriu所描述的那样。