Question

我正在处理一组包含GPS数据的~1000个大型（700MB +）CSV。时间戳目前在UTC时区，我想将它们更改为PST。

我编写了一个Python脚本来解析文件，用正确的值更新两个时间戳字段，然后将它们写入文件。最初我想最小化磁盘写入次数，因此在每行上我将更新后的行添加到字符串中。最后，我对文件做了一个大写。这可以按预期使用小文件，但挂起大文件。

然后我更改了脚本以在处理每一行时写入文件。这有效，并且不会挂起。

为什么第一个解决方案对大文件不起作用？有没有比一次写一行更好的方法呢？

构建一个大字符串：

def correct(d, s):
    # given a directory and a filename, corrects for timezone
    file = open(os.path.dirname(os.path.realpath(sys.argv[0])) + separator() + d + separator() + s)
    contents = file.read().splitlines()

    header = contents[0]

    corrected_contents = header + '\n'

    for line in contents[1:]:
        values = line.split(',')

        sample_date = correct_time(values[1])
        system_date = correct_time(values[-1])

        values[1] = sample_date
        values[-1] = system_date

        corrected_line = ','.join(map(str, values)) + '\n'
        corrected_contents += corrected_line

    corrected_file = os.path.dirname(os.path.realpath(sys.argv[0])) + separator() + d + separator() + "corrected_" + s
    with open (corrected_file, 'w') as text_file:
        text_file.write(corrected_contents)
    return corrected_file

写下每一行：

def correct(d, s):
    # given a directory and a filename, corrects for timezone
    file = open(os.path.dirname(os.path.realpath(sys.argv[0])) + separator() + d + separator() + s)
    contents = file.read().splitlines()

    header = contents[0]

    corrected_file = os.path.dirname(os.path.realpath(sys.argv[0])) + separator() + d + separator() + "corrected_" + s
    with open (corrected_file, 'w') as text_file:
        text_file.write(header + '\n')

        for line in contents[1:]:
            values = line.split(',')

            sample_date = correct_time(values[1])
            system_date = correct_time(values[-1])

            values[1] = sample_date
            values[-1] = system_date

            corrected_line = ','.join(map(str, values)) + '\n'
            text_file.write(corrected_line)

    return corrected_file

Answer 1

我相信这一行：

   corrected_contents += corrected_line

是罪魁祸首。 IIUC（我相信如果我错了，人们会纠正我）这会分配一个更大的字符串，复制旧内容，然后为文件中的每一行添加新内容 - 。随着时间的推移，越来越多的必须被复制，你最终会得到你正在观察的行为。

在How do I append one string to another in Python?有关于字符串连接的更多信息，其中提到显然CPython在某些情况下优化它并将其从二次变为线性（所以我上面可能是错的：你的可能是这样一个优化的案例）。它还提到pypy没有。所以它还取决于你如何运行你的程序。也可能是优化不适用的情况，因为你的字符串太大（毕竟它足以填满CD）。

链接的答案还有很多关于解决问题的方法的信息（如果确实是问题）。非常值得一读。

编写大型CSV - 内存使用v。随机磁盘访问

1 个答案: