在第一列Python 3.4.3中基于日期拆分大型csv文件

时间:2017-01-26 00:16:57

标签: python python-3.x csv split

好的,所以我在下面的链接中找到了我需要的部分答案,只要我的csv文件的第一列为2015-03-01,1,2,3,1,3格式,它就可以正常工作。当第一列更改为2015-03-01 00:00:00.000

时,如何保持此工作

How to split a huge csv file based on content of first column?

import csv
from itertools import groupby

for key, rows in groupby(csv.reader(open("largeFile.csv", "r", encoding='utf-16')),
                     lambda row: row[0]):
with open("%s.txt" % key, "w") as output:
    for row in rows:
        output.write(",".join(row) + "\n")

所以我有一个大文件,大约有170万行...

2015.01.01,NULL,NULL,NULL,NULL,NULL,0,1,0,1,0,0,0,1
2015.01.01,NULL,NULL,NULL,NULL,NULL,0,1,0,1,0,0,0,1
2015.01.01,NULL,NULL,NULL,NULL,NULL,0,1,0,1,0,0,0,1
2015.01.01,NULL,NULL,NULL,NULL,NULL,0,1,0,1,0,0,0,1
2015.01.01,NULL,NULL,NULL,NULL,NULL,0,1,0,1,0,0,0,1
2015.01.01,NULL,NULL,NULL,NULL,NULL,0,1,0,1,0,0,0,1
2015.01.01,NULL,NULL,NULL,NULL,NULL,0,1,0,1,0,0,0,1
2015.01.02,NULL,NULL,NULL,NULL,NULL,0,1,0,1,0,0,0,1
2015.01.02,NULL,NULL,NULL,NULL,NULL,0,1,0,1,0,0,0,1
2015.01.02,NULL,NULL,NULL,NULL,NULL,0,1,0,1,0,0,0,1
2015.01.02,NULL,NULL,NULL,NULL,NULL,0,1,0,1,0,0,0,1
2015.01.03,NULL,NULL,NULL,NULL,NULL,0,1,0,1,0,0,0,1
2015.01.03,NULL,NULL,NULL,NULL,NULL,0,1,0,1,0,0,0,1

该程序确实为每一天创建了一个新的文本文档,这很棒!

但是当列如下时,它会停止工作。

2015-03-01 00:00:01.000,NULL,NULL,NULL,NULL,NULL,0,1,0,1,0,0,0,1

2015-03-01 00:00:02.000,NULL,NULL,NULL,NULL,NULL,0,1,0,1,0,0,0,1

2015-03-02 00:00:01.000,NULL,NULL,NULL,NULL,NULL,0,1,0,1,0,0,0,1

2015-03-02 00:00:02.000,NULL,NULL,NULL,NULL,NULL,0,1,0,1,0,0,0,1

2015-03-02 00:00:03.000,NULL,NULL,NULL,NULL,NULL,0,1,0,1,0,0,0,1

2015-03-03 00:00:01.000,NULL,NULL,NULL,NULL,NULL,0,1,0,1,0,0,0,1

2015-03-03 00:00:02.000,NULL,NULL,NULL,NULL,NULL,0,1,0,1,0,0,0,1

它给了我以下错误。

  

Traceback(最近一次调用最后一次):文件   “C:\ Python34 \ Proj \ documents \ New folder \ dataPullSplit2.py”,第6行,in          使用open(“%s.txt”%key,“w”)作为输出:OSError:[Errno 22]无效参数:'2015-03-01 00:00:00.000.txt'

请有人指出我正确的方向。

Found Temp Solution

好的,所以通过将其从“w”更改为“a”我现在附加到文件并使用key[:-13]我能够切断文件名上的时间戳...它的工作原理......但它很慢......我怎样才能改进这一点并理解为什么它变得这么慢?

以下是代码

import csv
from itertools import groupby

for key, rows in groupby(csv.reader(open("asdf2.txt", "r", encoding='utf-16')),
                     lambda row: row[0]):

with open("%s.txt" % key[:-13], "a") as output:
    for row in rows:
        output.write(",".join(row) + "\n")

1 个答案:

答案 0 :(得分:1)

假设您的文件应保留模式2015.01.01,清除key应该有效:

key = key.split()[0].replace('-', '.')

完整代码:

import csv
from itertools import groupby


def shorten_key(key):
    return key.split()[0].replace('-', '.')


for key, rows in groupby(csv.reader(open("asdf2.txt", "r", encoding='utf-16')),
                         lambda row: shorten_key(row[0])):

    with open("%s.txt" % shorten_key(key), "a") as output:
        for row in rows:
            output.write(",".join(row) + "\n")

快速测试:

keys = ['2015-03-01 00:00:02.000',  '2015.01.01']

for key in keys:
    print(key.split()[0].replace('-', '.'))

输出:

2015.03.01
2015.01.01