好的,所以我在下面的链接中找到了我需要的部分答案,只要我的csv文件的第一列为2015-03-01,1,2,3,1,3
格式,它就可以正常工作。当第一列更改为2015-03-01 00:00:00.000
How to split a huge csv file based on content of first column?
import csv
from itertools import groupby
for key, rows in groupby(csv.reader(open("largeFile.csv", "r", encoding='utf-16')),
lambda row: row[0]):
with open("%s.txt" % key, "w") as output:
for row in rows:
output.write(",".join(row) + "\n")
所以我有一个大文件,大约有170万行...
2015.01.01,NULL,NULL,NULL,NULL,NULL,0,1,0,1,0,0,0,1
2015.01.01,NULL,NULL,NULL,NULL,NULL,0,1,0,1,0,0,0,1
2015.01.01,NULL,NULL,NULL,NULL,NULL,0,1,0,1,0,0,0,1
2015.01.01,NULL,NULL,NULL,NULL,NULL,0,1,0,1,0,0,0,1
2015.01.01,NULL,NULL,NULL,NULL,NULL,0,1,0,1,0,0,0,1
2015.01.01,NULL,NULL,NULL,NULL,NULL,0,1,0,1,0,0,0,1
2015.01.01,NULL,NULL,NULL,NULL,NULL,0,1,0,1,0,0,0,1
2015.01.02,NULL,NULL,NULL,NULL,NULL,0,1,0,1,0,0,0,1
2015.01.02,NULL,NULL,NULL,NULL,NULL,0,1,0,1,0,0,0,1
2015.01.02,NULL,NULL,NULL,NULL,NULL,0,1,0,1,0,0,0,1
2015.01.02,NULL,NULL,NULL,NULL,NULL,0,1,0,1,0,0,0,1
2015.01.03,NULL,NULL,NULL,NULL,NULL,0,1,0,1,0,0,0,1
2015.01.03,NULL,NULL,NULL,NULL,NULL,0,1,0,1,0,0,0,1
该程序确实为每一天创建了一个新的文本文档,这很棒!
但是当列如下时,它会停止工作。
2015-03-01 00:00:01.000,NULL,NULL,NULL,NULL,NULL,0,1,0,1,0,0,0,1
2015-03-01 00:00:02.000,NULL,NULL,NULL,NULL,NULL,0,1,0,1,0,0,0,1
2015-03-02 00:00:01.000,NULL,NULL,NULL,NULL,NULL,0,1,0,1,0,0,0,1
2015-03-02 00:00:02.000,NULL,NULL,NULL,NULL,NULL,0,1,0,1,0,0,0,1
2015-03-02 00:00:03.000,NULL,NULL,NULL,NULL,NULL,0,1,0,1,0,0,0,1
2015-03-03 00:00:01.000,NULL,NULL,NULL,NULL,NULL,0,1,0,1,0,0,0,1
2015-03-03 00:00:02.000,NULL,NULL,NULL,NULL,NULL,0,1,0,1,0,0,0,1
它给了我以下错误。
Traceback(最近一次调用最后一次):文件 “C:\ Python34 \ Proj \ documents \ New folder \ dataPullSplit2.py”,第6行,in 使用open(“%s.txt”%key,“w”)作为输出:OSError:[Errno 22]无效参数:'2015-03-01 00:00:00.000.txt'
请有人指出我正确的方向。
Found Temp Solution
好的,所以通过将其从“w”更改为“a”我现在附加到文件并使用key[:-13]
我能够切断文件名上的时间戳...它的工作原理......但它很慢......我怎样才能改进这一点并理解为什么它变得这么慢?
以下是代码
import csv
from itertools import groupby
for key, rows in groupby(csv.reader(open("asdf2.txt", "r", encoding='utf-16')),
lambda row: row[0]):
with open("%s.txt" % key[:-13], "a") as output:
for row in rows:
output.write(",".join(row) + "\n")
答案 0 :(得分:1)
假设您的文件应保留模式2015.01.01
,清除key
应该有效:
key = key.split()[0].replace('-', '.')
完整代码:
import csv
from itertools import groupby
def shorten_key(key):
return key.split()[0].replace('-', '.')
for key, rows in groupby(csv.reader(open("asdf2.txt", "r", encoding='utf-16')),
lambda row: shorten_key(row[0])):
with open("%s.txt" % shorten_key(key), "a") as output:
for row in rows:
output.write(",".join(row) + "\n")
快速测试:
keys = ['2015-03-01 00:00:02.000', '2015.01.01']
for key in keys:
print(key.split()[0].replace('-', '.'))
输出:
2015.03.01
2015.01.01