Question

我编写的代码读取大（＆gt; 15 GB）文本文件，并将csv文件的数据一次转换为一行。

txt_file = fileName+".txt"
    csv_file = fileName+".csv"
    with open(txt_file, "r") as tf, open(csv_file, "w") as cf:
        for line in tf:
            cf.writelines(" ".join(line.split()).replace(' ', ','))
            cf.write("\n")

编辑：
至于数据，
输入文件中的数据：
abc def ghi jkl

输出文件中的预期数据：
ABC，DEF，GHI，JKL

我在Mac OSX 10.10.3中使用Python 2.7.6

Answer 1

将CSV解析并格式化为csv模块：

import csv

txt_file = fileName + ".txt"
csv_file = fileName + ".csv"
with open(txt_file, "rb") as tf, open(csv_file, "wb") as cf:
    reader = csv.reader(tf, delimiter=' ')
    writer = csv.writer(cf)
    writer.writerows(reader)

或者如果你有奇怪的引用，将输入文件视为文本并手动拆分：

import csv

txt_file = fileName + ".txt"
csv_file = fileName + ".csv"
with open(txt_file, "rb") as tf, open(csv_file, "wb") as cf:
    writer = csv.writer(cf)
    writer.writerows(line.split() for line in tf)

文件流使用缓冲区以块的形式读取和写入数据。

Answer 2

我知道这在技术上并没有回答你的问题，但是如果你能够在你的python脚本之前处理文件，我相信使用sed将是最快的方法。考虑到你的大文件大小，我认为值得与非python相关的建议。

How to replace space with comma using sed

您可以在启动python脚本之前通过命令行执行此操作，甚至可以使用subprocess在脚本中调用它。

Answer 3

最简单的方法就是这样做。

with open("file.json", "r") as r, open("write.csv", "a") as w:
    lines = []
    for l in r:
        #Process l
        if len(lines) < 1000000: #Only uses 54mb of RAM so I hear
            lines.append(l)
        else:
            w.writelines(lines)
            del lines[:]

如何优化python代码以一次读取多行而不是一行？

3 个答案: