Question

我无法将文件读入内存并按照此处的建议更换行分隔符：https://stackoverflow.com/a/29903366/1100089

所以我写了自己的生成器，产生了下一行：

def __init__(self):
    self.encoding = 'utf-8'
    self.row_separator = '%$%$%$'
    self.column_separator = "|||"
    self.chunk_size = 2048

def get_row(self, fileObj):
    current_row = ''
    row_separator_length = len(self.row_separator)

    while True:
        # Read next chunk of data
        current_chunk = fileObj.read(self.chunk_size)
        if not current_chunk:
            # Yield last row
            if len(current_row.strip()) > 0:
                yield current_row

            break

        # Check if chunk contains row separator
        row_separator_position = current_chunk.find(self.row_separator)

        if row_separator_position == -1:
            # Chunk doesn't contain a new row => Append whole chunk
            current_row += current_chunk

            continue

        while row_separator_position > -1:
            # Chunk contains a new row => Append only until row separator
            yield current_row + current_chunk[:row_separator_position]

            # Start new row
            current_row = ''

            # Remaining characters are building a new chunk
            current_chunk = current_chunk[(row_separator_position + row_separator_length):]

            # Check if new chunk contains row separator
            row_separator_position = current_chunk.find(self.row_separator)

        # Remaining characters of chunk will be appended to next row
        current_row += current_chunk

代码工作正常，除非read方法将row_separator减半。因此第一行是FOO%$%$，第二行是%$BLA。我不知道如何继续。我应该只检查%而不是%$%$%$，如果找到%，请附加另一块数据以检查是否找到了%$%$%$？或者是否有一个我无法看到的更简单的解决方案？

编辑：正如所建议的，这是一个有问题的实例。如您所见，第一个算法，我的，不起作用，第二个算法（感谢@JohanL）：https://repl.it/JoLW/3

Answer 1

为什么使用.find()而不是.split()？使用.split()时，行分隔符将自动删除。与iter()函数一起读取文件块，这允许使用如下解决方案：

def get_row(self, fileObj):
    buffer = ''
    row_separator_length = len(self.row_separator)
    for chunk in iter((lambda:fileObj.read(self.chunk_size)),''):
        buffer += chunk
        rows = buffer.split(self.row_separator)
        for row in rows[:-1]: # Last row might not be complete
            yield row
        buffer = rows[-1]
    rows = buffer.split(self.row_separator)
    for row in rows: # Here all rows are complete
        yield row

当读取所有块时，当然需要最后的rows = buffer.split(self.row_separator)和以下for循环来处理文件的最后一行。

使用非标准行分隔符读取csv文件时出错

1 个答案: