Python CSV - 优化CSV读写

时间:2016-06-09 17:21:04

标签: python algorithm csv

当我的老板给我一个非常艰巨的任务时,我正在摆弄Python。

他给了我一个大小约为14GB的CSV文件,并问我是否可以通过多次复制将 该CSV扩展到4TB大小的分隔文件。

例如,请使用此CSV:

TIME_SK,ACCOUNT_NUMBER,ACCOUNT_TYPE_SK,ACCOUNT_STATUS_SK,CURRENCY_SK,GLACC_BUSINESS_NAME,PRODUCT_SK,PRODUCT_TERM_SK,NORMAL_BAL,SPECIAL_BAL,FINAL_MOV_YTD_BAL,NO_OF_DAYS_MTD,NO_OF_DAYS_YTD,BANK_FLAG,MEASURE_ID,SOURCE_SYSTEM_ID
20150131,F290006G93996,7,1,12,DEPOSIT INSURANCE EXPENSE,502,0,865.57767676670005,0,865.57767676670005,30,121,N,DEPOSIT INSURANCE EXPENSE,1
20150131,F2900F036FF90,7,9,12,GWM BALANCE,502,0,-139.0556,0,-139.0556,30,121,N,GWM BALANCE,1
20150131,F070007GG6790,7,1,12,DEPOSIT INSURANCE EXPENSE,1008,0,14100.016698793699,0,14100.016698793699,30,121,N,DEPOSIT INSURANCE EXPENSE,1
20150131,F2F00040FG982,7,1,12,DEPOSIT INSURANCE EXPENSE,502,0,8410.4009848750993,0,8410.4009848750993,30,121,N,DEPOSIT INSURANCE EXPENSE,1
20150131,FF30009944863,7,9,12,ACCOUNT PRINCIPAL,502,0,-2367.9400000000001,0,-2367.9400000000001,30,121,N,GL BALANCE,1
20150131,F240002FG722F,7,1,12,ACCOUNT PRINCIPAL,502,0,-28978292.390000001,0,-28978292.390000001,30,121,N,GL BALANCE,1
20150131,F0G00FFF74293,7,1,12,ACCOUNT PRINCIPAL,1008,0,-855196.81000000006,0,-855196.81000000006,30,121,N,GL BALANCE,1
20150131,FF20007947687,7,9,12,GWM BALANCE,2425,0,-368.45897600000001,0,-368.45897600000001,30,121,N,GWM BALANCE,1
20150131,F200007938744,7,1,12,GWM BALANCE,502,0,-19977.173964000001,0,-19977.173964000001,30,121,N,GWM BALANCE,1

他希望我通过复制CSV的内容来扩大规模,改变TIME_SK标题,如下所示:

TIME_SK,ACCOUNT_NUMBER,ACCOUNT_TYPE_SK,ACCOUNT_STATUS_SK,CURRENCY_SK,GLACC_BUSINESS_NAME,PRODUCT_SK,PRODUCT_TERM_SK,NORMAL_BAL,SPECIAL_BAL,FINAL_MOV_YTD_BAL,NO_OF_DAYS_MTD,NO_OF_DAYS_YTD,BANK_FLAG,MEASURE_ID,SOURCE_SYSTEM_ID
20150131,F290006G93996,7,1,12,DEPOSIT INSURANCE EXPENSE,502,0,865.57767676670005,0,865.57767676670005,30,121,N,DEPOSIT INSURANCE EXPENSE,1
20150131,F2900F036FF90,7,9,12,GWM BALANCE,502,0,-139.0556,0,-139.0556,30,121,N,GWM BALANCE,1
20150131,F070007GG6790,7,1,12,DEPOSIT INSURANCE EXPENSE,1008,0,14100.016698793699,0,14100.016698793699,30,121,N,DEPOSIT INSURANCE EXPENSE,1
20150131,F2F00040FG982,7,1,12,DEPOSIT INSURANCE EXPENSE,502,0,8410.4009848750993,0,8410.4009848750993,30,121,N,DEPOSIT INSURANCE EXPENSE,1
20150131,FF30009944863,7,9,12,ACCOUNT PRINCIPAL,502,0,-2367.9400000000001,0,-2367.9400000000001,30,121,N,GL BALANCE,1
20150131,F240002FG722F,7,1,12,ACCOUNT PRINCIPAL,502,0,-28978292.390000001,0,-28978292.390000001,30,121,N,GL BALANCE,1
20150131,F0G00FFF74293,7,1,12,ACCOUNT PRINCIPAL,1008,0,-855196.81000000006,0,-855196.81000000006,30,121,N,GL BALANCE,1
20150131,FF20007947687,7,9,12,GWM BALANCE,2425,0,-368.45897600000001,0,-368.45897600000001,30,121,N,GWM BALANCE,1
20150131,F200007938744,7,1,12,GWM BALANCE,502,0,-19977.173964000001,0,-19977.173964000001,30,121,N,GWM BALANCE,1
20150201,F290006G93996,7,1,12,DEPOSIT INSURANCE EXPENSE,502,0,865.57767676670005,0,865.57767676670005,30,121,N,DEPOSIT INSURANCE EXPENSE,1
20150201,F2900F036FF90,7,9,12,GWM BALANCE,502,0,-139.0556,0,-139.0556,30,121,N,GWM BALANCE,1
20150201,F070007GG6790,7,1,12,DEPOSIT INSURANCE EXPENSE,1008,0,14100.016698793699,0,14100.016698793699,30,121,N,DEPOSIT INSURANCE EXPENSE,1
20150201,F2F00040FG982,7,1,12,DEPOSIT INSURANCE EXPENSE,502,0,8410.4009848750993,0,8410.4009848750993,30,121,N,DEPOSIT INSURANCE EXPENSE,1
20150201,FF30009944863,7,9,12,ACCOUNT PRINCIPAL,502,0,-2367.9400000000001,0,-2367.9400000000001,30,121,N,GL BALANCE,1
20150201,F240002FG722F,7,1,12,ACCOUNT PRINCIPAL,502,0,-28978292.390000001,0,-28978292.390000001,30,121,N,GL BALANCE,1
20150201,F0G00FFF74293,7,1,12,ACCOUNT PRINCIPAL,1008,0,-855196.81000000006,0,-855196.81000000006,30,121,N,GL BALANCE,1
20150201,FF20007947687,7,9,12,GWM BALANCE,2425,0,-368.45897600000001,0,-368.45897600000001,30,121,N,GWM BALANCE,1
20150201,F200007938744,7,1,12,GWM BALANCE,502,0,-19977.173964000001,0,-19977.173964000001,30,121,N,GWM BALANCE,1

等等。

我能够让Python脚本完成任务,但是当在真正的CSV文件上使用时,其数量为数十GB,行数达数亿,这项任务被证明太长了,无法完成(有一个当时的时间限制;但是,他让我现在再次这样做。)

我使用的是CSV Writer内置的Python。经过一番研究后,我提出了两种不同的方法:

1。旧的和受信任的迭代器

这是我脚本的第一个版本;它可以完成任务,但是处理庞大的CSV需要很长时间。

. . . omitted . . .
with open('../csv/DAILY_DDMAST.csv', 'rb') as csvinput:
    with open('../result/DAILY_DDMAST_result1'+name_interval+'.csv', 'wb') as csvoutput:
        reader = csv.reader(csvinput)
        writer = csv.writer(csvoutput, lineterminator='\r\n')
# This part copies the original CSV to a new file
        for row in reader:
            writer.writerow(row)
        print("Done copying. Time elapsed: %s seconds, Total time: %s seconds" % 
              ((time.time() - start_time), (time.time() - start_time)))
        i = 0
        while i < 5:
# This part replicates the content of CSV, with modifying the TIME_SK value
            counter_time = time.time()
            for row in reader:
                newdate = datetime.datetime.strptime(row[0], "%Y%m%d") + datetime.timedelta(days=i)
                row[0] = newdate.strftime("%Y%m%d")
                writer.writerow(row)
            csvinput.seek(0)
            next(reader, None)
            print("Done processing for i = %d. Time elapsed: %s seconds, Total time: %s seconds" % 
              (i+1, (counter_time - start_time), (time.time() - start_time)))
            i += 1
. . . omitted . . . 

根据我的理解,脚本将按for row in reader迭代CSV中的每一行,然后使用writer.writerow(row)每个行写入新文件。我还发现通过迭代源文件,它有点重复和耗时,所以我认为用其他方法可以提高效率......

2。水桶

这是作为"升级"到脚本的第一个版本。

. . . omitted . . .
with open('../csv/DAILY_DDMAST.csv', 'rb') as csvinput:
    with open('../result/DAILY_DDMAST_result2'+name_interval+'.csv', 'wb') as csvoutput:
        reader = csv.reader(csvinput)
        writer = csv.writer(csvoutput, lineterminator='\r\n')
        csv_buffer = list()
        for row in reader:
# Here, rather than directly writing the iterated row, I stored it in a list.
# If the list reached 1 mio rows, then it writes to the file and empty the "bucket"
            csv_buffer.append(row)
            if len(csv_buffer) > 1000000:
                writer.writerows(csv_buffer)
                del csv_buffer[:]
        writer.writerows(csv_buffer)
        print("Done copying. Time elapsed: %s seconds, Total time: %s seconds" % 
              ((time.time() - start_time), (time.time() - start_time)))
        i = 0
        while i < 5:
            counter_time = time.time()
            del csv_buffer[:]
            for row in reader:
                newdate = datetime.datetime.strptime(row[0], "%Y%m%d") + datetime.timedelta(days=i)
                row[0] = newdate.strftime("%Y%m%d")
# Same goes here
                csv_buffer.append(row)
                if len(csv_buffer) > 1000000:
                    writer.writerows(csv_buffer)
                    del csv_buffer[:]
            writer.writerows(csv_buffer)
            csvinput.seek(0)
            next(reader, None)
            print("Done processing for i = %d. Time elapsed: %s seconds, Total time: %s seconds" % 
                  (i+1, (counter_time - start_time), (time.time() - start_time)))            
            i += 1
. . . omitted . . . 

我想,通过将其存储在内存中,然后将其与writerows完全一起编写,我可以节省时间。但事实并非如此。我发现即使我将要写入的行存储到新CSV中,writerows 迭代列表然后将它们写入新文件,因此它消耗的时间几乎与第一个一样长脚本...

此时,我不知道我是否应该提出更好的算法,或者我可以使用某些东西 - 比如writerows,只有它不会迭代,而是全部写入马上。

  

我不知道这样的事情是否可能,

无论如何,我需要帮助,如果有人可以点灯,我会非常感激!

3 个答案:

答案 0 :(得分:1)

我没有14GB的文件来试试这个,所以内存占用是一个问题。比我更了解正则表达式的人可能会有一些性能调整建议。

主要概念是在可避免的情况下不迭代每一行。让re在文本的整个主体上做它的魔力,然后将该主体写入文件。

import re

newdate = "20150201,"
f = open('sample.csv', 'r')
g = open('result.csv', 'w')

body = f.read()
## keeps the original csv
g.write(body)  
# strip off the header -- we already have one.
header, mainbody = body.split('\n', 1)
# replace all the dates
newbody = re.sub(r"20150131,", newdate, mainbody)
#end of the body didn't have a newline. Adding one back in.
g.write('\n' + newbody)

f.close()
g.close()

答案 1 :(得分:1)

批量编写行并不是一个改进,因为你的写入IO仍然会是相同的大小。如果您可以增加IO大小,那么批量写入只会给您带来改进,从而减少系统调用次数,并允许IO系统处理更少但更大的写入。

老实说,出于可维护性原因,我不会因批量编写而使代码复杂化,但如果仅出于教育原因,我当然可以理解试图提高速度的尝试。

您要做的是批量写入 - 批量编写csv行并不能实现此目的。

[使用StringIO移除的示例..这是一种更好的方式。]

Python write()使用缓冲的I / O.它默认情况下缓冲为4k(在Linux上)。如果您使用buffering参数打开文件,则可以将其设置为更大:

with open("/tmp/x", "w", 1024*1024) as fd:
    for i in range(0, 1000000):
        fd.write("line %d\n" %i)

然后你的写入将是1MB。 strace输出:

write(3, "line 0\nline 1\nline 2\nline 3\nline"..., 1048576) = 1048576
write(3, "ine 96335\nline 96336\nline 96337\n"..., 1048576) = 1048576
write(3, "1\nline 184022\nline 184023\nline 1"..., 1048576) = 1048576
write(3, "ne 271403\nline 271404\nline 27140"..., 1048576) = 1048576
write(3, "58784\nline 358785\nline 358786\nli"..., 1048576) = 1048576
write(3, "5\nline 446166\nline 446167\nline 4"..., 1048576) = 1048576
write(3, "ne 533547\nline 533548\nline 53354"..., 1048576) = 1048576
[...]

您更简单的原始代码将起作用,您只需要更改open()调用的块大小(我会为源目标更改它。)

我的另一个建议是放弃csv,但这可能会带来一些风险。如果您在其中引用了带逗号的字符串,则必须创建正确类型的解析器。

但是 - 由于您要修改的字段是相当规则的,并且第一个字段,您可能会发现只有readline / write循环更简单你只需要替换第一个字段而忽略其余字段。

#!/usr/bin/python
import datetime
import re

with open("/tmp/out", "w", 1024*1024) as fdout, open("/tmp/in", "r", 1024*1024) as fdin:
    for i in range(0, 6):
        fdin.seek(0)
        for line in fdin:
            if i == 0:
                fdout.write(line)
                continue
            match = re.search(r"^(\d{8}),", line)
            if match:
                date = datetime.datetime.strptime(match.group(1), "%Y%m%d")
                fdout.write(re.sub("^\d{8},", (date + datetime.timedelta(days=i)).strftime("%Y%m%d,"), line))
            else:
                if line.startswith("TIME_SK,"):
                    continue
                raise Exception("Could not find /^\d{8},/ in '%s'" % line)

如果订单无关紧要,请不要一遍又一遍地重读该文件:

#!/usr/bin/python
import datetime
import re

with open("/tmp/in", "r", 1024*1024) as fd, open("/tmp/out", "w", 1024*1024) as out:
    for line in fd:
        match = re.search("^(\d{8}),", line)
        if match:
            out.write(line)
            date = datetime.datetime.strptime(match.group(1), "%Y%m%d")
            for days in  range(1, 6):
                out.write(re.sub("^\d{8},", (date + datetime.timedelta(days=days)).strftime("%Y%m%d,"), line))
        else:
            if line.startswith("TIME_SK,"):
                out.write(line)
                continue
            raise Exception("Could not find /^\d{8},/ in %s" % line)

我继续用python -mcProfile对其中一个进行了描述,并对strptime花了多少时间感到惊讶。同时尝试使用此已记忆的strptime()

缓存strptime()来电
_STRPTIME = {}

def strptime(s):
    if s not in _STRPTIME:
        _STRPTIME[s] = datetime.datetime.strptime(s, "%Y%m%d")
    return _STRPTIME[s]

答案 2 :(得分:1)

首先,你将受到写入速度的限制。台式机的典型写入速度大约为每千兆字节40秒。你需要写4,000千兆字节,所以它只需要160,000秒(44.5小时)来编写输出。减少时间的唯一方法是获得更快的驱动器。

要通过复制14 GB文件制作4 TB文件,您必须复制原始文件286(实际上是285.71)次。最简单的做法是:

open output file
starting_date = date on first transaction
for pass = 1 to 286
    open original file
    while not end of file
        read transaction
        replace date
        write to output
        increment date
    end while
end for
close output file

但是,每千兆字节的典型读取速度大约为20秒,那就是80,000秒(22小时15分钟),只是为了阅读。

你无法对写作时间做任何事情,但你可能会减少很多阅读时间。

如果你可以缓冲整个14 GB的输入文件,那么阅读时间大约是五分钟。

如果没有内存来容纳14 GB,请考虑将其读入压缩内存流。 CSV应该压缩得很好 - 不到目前大小的一半。然后,不是每次通过循环打开输入文件,而只是从您在内存中保存的文件的压缩副本重新初始化流阅读器。

在C#中,我只使用MemoryStreamGZipStream类。一个快速的谷歌搜索表明python中存在类似的功能,但由于我不是python程序员,我无法确切地告诉你如何使用它们。