我有一个只包含三列但超过200K行的csv文件。我想将csv文件拆分为多个csv文件,取决于第二列(时间列),因此每个文件中的列数相同但行数较少(取决于我的规范)。我希望持续时间是可变的,就像我可以将10秒读数放到每个文件或15秒或19秒。我尝试了几个代码来拆分csv文件,但我没有成功,因为我对python很新。
输入csv文件将如下所示:
Col 0 Col 1 Col 2 Col 3
Data YYY 12:40:05 Data XXX
Data YYY 12:40:06 Data XXX
Data YYY 12:40:07 Data XXX
Data YYY 12:40:08 Data XXX
Data YYY 12:40:09 Data XXX
Data YYY 12:40:10 Data XXX
Data YYY 12:40:11 Data XXX
Data YYY 12:40:12 Data XXX
Data YYY 12:40:13 Data XXX
输出csv文件,我想成为: 文件1
Col 0 Col 1 Col 2 Col 3
Data YYY 12:40:05 Data XXX
Data YYY 12:40:06 Data XXX
Data YYY 12:40:07 Data XXX
file2的
Col 0 Col 1 Col 2 Col 3
Data YYY 12:40:08 Data XXX
Data YYY 12:40:09 Data XXX
Data YYY 12:40:10 Data XXX
file3的
Col 0 Col 1 Col 2 Col 3
Data YYY 12:40:11 Data XXX
Data YYY 12:40:12 Data XXX
Data YYY 12:40:13 Data XXX
等等到最后(上面的变量等于3秒)。 我的python代码是:
import csv
from datetime import datetime
fieldnames = ['Col 0', 'Hour', 'Minute' , 'Second', 'Col 2' , 'Col 3']
files = {}
writers = {}
seconds = []
with open('4_Columns_PRi_Output.csv') as csvfile:
reader = csv.DictReader(csvfile)
for row in reader:
output_row = {}
output_row['Col 0'] = row['Col 0']
change_date = datetime.strptime(row['Col 1'].split(',')[0], '%H:%M:%S')
output_row['Hour'] = change_date.strftime('%H')
output_row['Minute'] = change_date.strftime('%M')
sec = change_date.strftime('%S')
output_row['Second'] = sec
if sec not in seconds:
output_file = open('corrected'+str(sec)+".csv", 'w')
writer = csv.DictWriter(output_file, fieldnames=fieldnames,lineterminator='\n')
writer.writeheader()
files[sec] = output_file
writers[sec] = writer
seconds.append(sec)
else:
output_file = open('corrected'+str(sec)+".csv", 'w+')
writer = csv.DictWriter(output_file, fieldnames=fieldnames,lineterminator='\n')
output_row['Col 2'] = row['Col 2']
output_row['Col 3'] = row['Col 3'].strip()
writers[sec].writerow(output_row)
for key in files:
files[key].close()
我们非常感谢您的帮助。
答案 0 :(得分:0)
在Python中,您可以像datetimes
那样比较ints
。像
>>> this_morning = datetime.datetime(2009, 12, 2, 9, 30)
>>> last_night = datetime.datetime(2009, 12, 1, 20, 0)
>>> this_morning.time() < last_night.time()
将以True
解析。 Source.
您还可以添加(或减去)datetimes
。示例:
import datetime
a = datetime.datetime(100,1,1,11,34,59)
b = a + datetime.timedelta(seconds=3)
当打印输出11:34:59
和11:35:02
时。 Source
因此,在编写csv文件时,请保留要放入的datetime
个对象的列表。对于列表中的第一个datetime
,使用{{1}为其添加N秒}}。在构建列表时,请检查maxTime = firstTime + datetime.timedelta(seconds=N)
。如果它解析thisTime <= maxTime
,则启动一个新文件并在该文件上重复执行。
答案 1 :(得分:0)
请阅读整个代码中的注释以获得解释。
本质上我有3种方法:
t()
将整个文本视为您提供的blob。partTuples(tupleList, secs)
根据secs dtFromString(s)
帮助将HH:MM:SS解析为日期时间对象t()
提供)from datetime import datetime
from datetime import timedelta
from datetime import date
def t() :
# spacing
# 1 2 3 4
#234567890123456789012345678901234567890123
return '''
Col 0 Col 1 Col 2 Col 3
Data YYY 12:40:05 Data XXX
Data YYY 12:40:06 Data XXX
Data YYY 12:40:07 Data XXX
Data YYY 12:40:08 Data XXX
Data YYY 12:40:09 Data XXX
Data YYY 12:40:10 Data XXX
Data YYY 12:40:11 Data XXX
Data YYY 12:40:12 Data XXX
Data YYY 12:40:13 Data XXX
'''
# splits parsed lines into arrays that contain one files content
def partTuples(tupelList, secs):
rv = []
oneSet = []
doneTime = None
for t in tupelList:
myTime = t[1].time()
if doneTime == None:
doneTime = (datetime.combine(date.today(), myTime) + timedelta(seconds=secs)).time()
if myTime <= doneTime:
oneSet.append(t[:])
elif (myTime > doneTime):
rv.append(oneSet[:]) # copy
oneSet = []
oneSet.append(t[:])
doneTime = (datetime.combine(date.today(), myTime) + timedelta(seconds=secs)).time()
if len(oneSet) > 0:
rv.append(oneSet[:])
return rv
def dtFromString(s):
splitted = s.split(":")
hh = int(splitted[0])
mm = int(splitted[1])
ss = int(splitted[2])
return datetime.combine(date.today(), datetime(2000,1,1,hh, mm,ss).time())
# parses your files data into a list, parses a datetime object from text
# if you have csv with , seperation instead of the above printed fixed column
# length data - you need to adapt this
# I did not bother to parse the Col 3 as its empty anyway - adapt that as well
tpls = [ (x[0:8].strip(), dtFromString(x[9:20]), x[21:].strip(),"") for x in t().splitlines() if len(x.strip()) > 0 and not "Col" in x]
# print parsed file
print()
print(tpls)
# print splittet content - empty line == new file
print()
for fileCont in partTuples(tpls,3):
for parts in fileCont:
print(parts)
print()
输出:
[('Data YYY', datetime.datetime(2017, 12, 1, 12, 40, 5), 'Data XXX', ''), ('Data
YYY', datetime.datetime(2017, 12, 1, 12, 40, 6), 'Data XXX', ''), ('Data YYY',
datetime.datetime(2017, 12, 1, 12, 40, 7), 'Data XXX', ''), ('Data YYY', datetim
e.datetime(2017, 12, 1, 12, 40, 8), 'Data XXX', ''), ('Data YYY', datetime.datet
ime(2017, 12, 1, 12, 40, 9), 'Data XXX', ''), ('Data YYY', datetime.datetime(201
7, 12, 1, 12, 40, 10), 'Data XXX', ''), ('Data YYY', datetime.datetime(2017, 12,
1, 12, 40, 11), 'Data XXX', ''), ('Data YYY', datetime.datetime(2017, 12, 1, 12
, 40, 12), 'Data XXX', ''), ('Data YYY', datetime.datetime(2017, 12, 1, 12, 40,
13), 'Data XXX', '')]
('Data YYY', datetime.datetime(2017, 12, 1, 12, 40, 5), 'Data XXX', '')
('Data YYY', datetime.datetime(2017, 12, 1, 12, 40, 6), 'Data XXX', '')
('Data YYY', datetime.datetime(2017, 12, 1, 12, 40, 7), 'Data XXX', '')
('Data YYY', datetime.datetime(2017, 12, 1, 12, 40, 8), 'Data XXX', '')
('Data YYY', datetime.datetime(2017, 12, 1, 12, 40, 9), 'Data XXX', '')
('Data YYY', datetime.datetime(2017, 12, 1, 12, 40, 10), 'Data XXX', '')
('Data YYY', datetime.datetime(2017, 12, 1, 12, 40, 11), 'Data XXX', '')
('Data YYY', datetime.datetime(2017, 12, 1, 12, 40, 12), 'Data XXX', '')
('Data YYY', datetime.datetime(2017, 12, 1, 12, 40, 13), 'Data XXX', '')
Press any key to continue . . .
答案 2 :(得分:0)
首先将您的时间转换为datedate
对象。然后可以使用timedelta
对象将所需的秒数提前一步。
此脚本会一直读取行,直到到达下一个边界。然后使用起始时间作为文件名将累积的行写入输出CSV文件:
from datetime import datetime, timedelta
import csv
def output_csv(output):
filename = "{}.csv".format(get_dt(output[0]).strftime("%H_%M_%S"))
with open(filename, 'w', newline='') as f_output:
csv_writer = csv.writer(f_output)
csv_writer.writerow(header)
csv_writer.writerows(output)
get_dt = lambda x: datetime.strptime(x[1], '%H:%M:%S')
seconds = timedelta(seconds=3) # set number of seconds to advance
with open('input.csv', 'r', newline='') as f_input:
csv_reader = csv.reader(f_input)
header = next(csv_reader)
output = [next(csv_reader)]
read_until = get_dt(output[0]) + seconds
for row in csv_reader:
if get_dt(row) >= read_until:
read_until += seconds
output_csv(output)
output = []
output.append(row)
output_csv(output)
例如,您的第一张CSV为12_40_05.csv
:
Col 0,Col 1,Col 2,Col 3
Data YYY,12:40:05,Data XXX
Data YYY,12:40:06,Data XXX
Data YYY,12:40:07,Data XXX