使用非标准行分隔符读取csv文件时出错

时间:2017-07-29 08:50:17

标签: python python-3.x csv

我无法将文件读入内存并按照此处的建议更换行分隔符:https://stackoverflow.com/a/29903366/1100089

所以我写了自己的生成器,产生了下一行:

def __init__(self):
    self.encoding = 'utf-8'
    self.row_separator = '%$%$%$'
    self.column_separator = "|||"
    self.chunk_size = 2048

def get_row(self, fileObj):
    current_row = ''
    row_separator_length = len(self.row_separator)

    while True:
        # Read next chunk of data
        current_chunk = fileObj.read(self.chunk_size)
        if not current_chunk:
            # Yield last row
            if len(current_row.strip()) > 0:
                yield current_row

            break

        # Check if chunk contains row separator
        row_separator_position = current_chunk.find(self.row_separator)

        if row_separator_position == -1:
            # Chunk doesn't contain a new row => Append whole chunk
            current_row += current_chunk

            continue

        while row_separator_position > -1:
            # Chunk contains a new row => Append only until row separator
            yield current_row + current_chunk[:row_separator_position]

            # Start new row
            current_row = ''

            # Remaining characters are building a new chunk
            current_chunk = current_chunk[(row_separator_position + row_separator_length):]

            # Check if new chunk contains row separator
            row_separator_position = current_chunk.find(self.row_separator)

        # Remaining characters of chunk will be appended to next row
        current_row += current_chunk

代码工作正常,除非read方法将row_separator减半。因此第一行是FOO%$%$,第二行是%$BLA。我不知道如何继续。我应该只检查%而不是%$%$%$,如果找到%,请附加另一块数据以检查是否找到了%$%$%$?或者是否有一个我无法看到的更简单的解决方案?

编辑:正如所建议的,这是一个有问题的实例。如您所见,第一个算法,我的,不起作用,第二个算法(感谢@JohanL):https://repl.it/JoLW/3

1 个答案:

答案 0 :(得分:1)

为什么使用.find()而不是.split()?使用.split()时,行分隔符将自动删除。与iter()函数一起读取文件块,这允许使用如下解决方案:

def get_row(self, fileObj):
    buffer = ''
    row_separator_length = len(self.row_separator)
    for chunk in iter((lambda:fileObj.read(self.chunk_size)),''):
        buffer += chunk
        rows = buffer.split(self.row_separator)
        for row in rows[:-1]: # Last row might not be complete
            yield row
        buffer = rows[-1]
    rows = buffer.split(self.row_separator)
    for row in rows: # Here all rows are complete
        yield row

当读取所有块时,当然需要最后的rows = buffer.split(self.row_separator)和以下for循环来处理文件的最后一行。