我无法将文件读入内存并按照此处的建议更换行分隔符:https://stackoverflow.com/a/29903366/1100089
所以我写了自己的生成器,产生了下一行:
def __init__(self):
self.encoding = 'utf-8'
self.row_separator = '%$%$%$'
self.column_separator = "|||"
self.chunk_size = 2048
def get_row(self, fileObj):
current_row = ''
row_separator_length = len(self.row_separator)
while True:
# Read next chunk of data
current_chunk = fileObj.read(self.chunk_size)
if not current_chunk:
# Yield last row
if len(current_row.strip()) > 0:
yield current_row
break
# Check if chunk contains row separator
row_separator_position = current_chunk.find(self.row_separator)
if row_separator_position == -1:
# Chunk doesn't contain a new row => Append whole chunk
current_row += current_chunk
continue
while row_separator_position > -1:
# Chunk contains a new row => Append only until row separator
yield current_row + current_chunk[:row_separator_position]
# Start new row
current_row = ''
# Remaining characters are building a new chunk
current_chunk = current_chunk[(row_separator_position + row_separator_length):]
# Check if new chunk contains row separator
row_separator_position = current_chunk.find(self.row_separator)
# Remaining characters of chunk will be appended to next row
current_row += current_chunk
代码工作正常,除非read方法将row_separator
减半。因此第一行是FOO%$%$
,第二行是%$BLA
。我不知道如何继续。我应该只检查%
而不是%$%$%$
,如果找到%
,请附加另一块数据以检查是否找到了%$%$%$
?或者是否有一个我无法看到的更简单的解决方案?
编辑:正如所建议的,这是一个有问题的实例。如您所见,第一个算法,我的,不起作用,第二个算法(感谢@JohanL):https://repl.it/JoLW/3
答案 0 :(得分:1)
为什么使用.find()
而不是.split()
?使用.split()
时,行分隔符将自动删除。与iter()
函数一起读取文件块,这允许使用如下解决方案:
def get_row(self, fileObj):
buffer = ''
row_separator_length = len(self.row_separator)
for chunk in iter((lambda:fileObj.read(self.chunk_size)),''):
buffer += chunk
rows = buffer.split(self.row_separator)
for row in rows[:-1]: # Last row might not be complete
yield row
buffer = rows[-1]
rows = buffer.split(self.row_separator)
for row in rows: # Here all rows are complete
yield row
当读取所有块时,当然需要最后的rows = buffer.split(self.row_separator)
和以下for
循环来处理文件的最后一行。