如何将csv文件拆分为多个csv文件取决于使用Python的持续时间?

时间:2017-12-01 18:05:46

标签: python python-3.x csv

我有一个只包含三列但超过200K行的csv文件。我想将csv文件拆分为多个csv文件,取决于第二列(时间列),因此每个文件中的列数相同但行数较少(取决于我的规范)。我希望持续时间是可变的,就像我可以将10秒读数放到每个文件或15秒或19秒。我尝试了几个代码来拆分csv文件,但我没有成功,因为我对python很新。

输入csv文件将如下所示:

Col 0       Col 1       Col 2       Col 3
Data YYY    12:40:05    Data XXX
Data YYY    12:40:06    Data XXX
Data YYY    12:40:07    Data XXX
Data YYY    12:40:08    Data XXX
Data YYY    12:40:09    Data XXX
Data YYY    12:40:10    Data XXX
Data YYY    12:40:11    Data XXX
Data YYY    12:40:12    Data XXX
Data YYY    12:40:13    Data XXX

输出csv文件,我想成为: 文件1

Col 0       Col 1       Col 2       Col 3
Data YYY    12:40:05    Data XXX
Data YYY    12:40:06    Data XXX
Data YYY    12:40:07    Data XXX

file2的

Col 0       Col 1       Col 2       Col 3
Data YYY    12:40:08    Data XXX
Data YYY    12:40:09    Data XXX
Data YYY    12:40:10    Data XXX

file3的

Col 0       Col 1       Col 2       Col 3
Data YYY    12:40:11    Data XXX
Data YYY    12:40:12    Data XXX
Data YYY    12:40:13    Data XXX

等等到最后(上面的变量等于3秒)。 我的python代码是:

    import csv
    from datetime import datetime

    fieldnames = ['Col 0', 'Hour', 'Minute' , 'Second', 'Col 2' , 'Col 3']

    files = {}
    writers = {}
    seconds = []

    with open('4_Columns_PRi_Output.csv') as csvfile:
        reader = csv.DictReader(csvfile)
        for row in reader:
            output_row = {}
            output_row['Col 0'] = row['Col 0']
            change_date = datetime.strptime(row['Col 1'].split(',')[0], '%H:%M:%S')
            output_row['Hour'] = change_date.strftime('%H')
            output_row['Minute'] = change_date.strftime('%M')
            sec = change_date.strftime('%S')
            output_row['Second'] = sec

            if sec not in seconds:
                output_file = open('corrected'+str(sec)+".csv", 'w')
                writer = csv.DictWriter(output_file, fieldnames=fieldnames,lineterminator='\n')
                writer.writeheader()
                files[sec] = output_file
                writers[sec] = writer
                seconds.append(sec)
            else:
                output_file = open('corrected'+str(sec)+".csv", 'w+')
                writer = csv.DictWriter(output_file, fieldnames=fieldnames,lineterminator='\n')
            output_row['Col 2'] = row['Col 2']
            output_row['Col 3'] = row['Col 3'].strip()
            writers[sec].writerow(output_row)

    for key in files:
        files[key].close()

我们非常感谢您的帮助。

3 个答案:

答案 0 :(得分:0)

在Python中,您可以像datetimes那样比较ints。像

这样的东西
>>> this_morning = datetime.datetime(2009, 12, 2, 9, 30)
>>> last_night = datetime.datetime(2009, 12, 1, 20, 0)
>>> this_morning.time() < last_night.time()

将以True解析。 Source.

您还可以添加(或减去)datetimes。示例:

import datetime
a = datetime.datetime(100,1,1,11,34,59)
b = a + datetime.timedelta(seconds=3)

当打印输出11:34:5911:35:02时。 Source

因此,在编写csv文件时,请保留要放入的datetime个对象的列表。对于列表中的第一个datetime,使用{{1}为其添加N秒}}。在构建列表时,请检查maxTime = firstTime + datetime.timedelta(seconds=N)。如果它解析thisTime <= maxTime,则启动一个新文件并在该文件上重复执行。

答案 1 :(得分:0)

请阅读整个代码中的注释以获得解释。

本质上我有3种方法:

  • t()将整个文本视为您提供的blob。
  • partTuples(tupleList, secs)根据secs
  • 对预处理csv列表进行分区
  • dtFromString(s)帮助将HH:MM:SS解析为日期时间对象
  • 用于预处理csv数据的列表解析(由t()提供)
from datetime import datetime
from datetime import timedelta 
from datetime import date 

def t() :
# spacing
#        1         2         3         4 
#234567890123456789012345678901234567890123
    return '''
Col 0       Col 1       Col 2       Col 3
Data YYY    12:40:05    Data XXX
Data YYY    12:40:06    Data XXX
Data YYY    12:40:07    Data XXX
Data YYY    12:40:08    Data XXX
Data YYY    12:40:09    Data XXX
Data YYY    12:40:10    Data XXX
Data YYY    12:40:11    Data XXX
Data YYY    12:40:12    Data XXX
Data YYY    12:40:13    Data XXX 
'''

# splits parsed lines into arrays that contain one files content
def partTuples(tupelList, secs):
    rv = []
    oneSet = []
    doneTime = None

    for t in tupelList:
        myTime = t[1].time()

        if doneTime == None: 
            doneTime = (datetime.combine(date.today(), myTime) + timedelta(seconds=secs)).time()

        if myTime <= doneTime:
            oneSet.append(t[:])

        elif (myTime > doneTime):
            rv.append(oneSet[:]) # copy
            oneSet = []
            oneSet.append(t[:]) 
            doneTime = (datetime.combine(date.today(), myTime) + timedelta(seconds=secs)).time()

    if len(oneSet) > 0:
        rv.append(oneSet[:])

    return rv


def dtFromString(s):
    splitted = s.split(":")
    hh = int(splitted[0])
    mm = int(splitted[1])
    ss = int(splitted[2])
    return datetime.combine(date.today(), datetime(2000,1,1,hh, mm,ss).time())

# parses your files data into a list, parses a datetime object from text
# if you have csv with , seperation instead of the above printed fixed column
# length data - you need to adapt this
# I did not bother to parse the Col 3 as its empty anyway - adapt that as well
tpls = [ (x[0:8].strip(), dtFromString(x[9:20]), x[21:].strip(),"") for x in t().splitlines() if len(x.strip()) > 0 and not "Col" in x]


# print parsed file
print()
print(tpls)

# print splittet content - empty line == new file          
print() 
for fileCont in partTuples(tpls,3):
    for parts in fileCont:
        print(parts)     
    print()

输出:

[('Data YYY', datetime.datetime(2017, 12, 1, 12, 40, 5), 'Data XXX', ''), ('Data
 YYY', datetime.datetime(2017, 12, 1, 12, 40, 6), 'Data XXX', ''), ('Data YYY',
datetime.datetime(2017, 12, 1, 12, 40, 7), 'Data XXX', ''), ('Data YYY', datetim
e.datetime(2017, 12, 1, 12, 40, 8), 'Data XXX', ''), ('Data YYY', datetime.datet
ime(2017, 12, 1, 12, 40, 9), 'Data XXX', ''), ('Data YYY', datetime.datetime(201
7, 12, 1, 12, 40, 10), 'Data XXX', ''), ('Data YYY', datetime.datetime(2017, 12,
 1, 12, 40, 11), 'Data XXX', ''), ('Data YYY', datetime.datetime(2017, 12, 1, 12
, 40, 12), 'Data XXX', ''), ('Data YYY', datetime.datetime(2017, 12, 1, 12, 40,
13), 'Data XXX', '')]

('Data YYY', datetime.datetime(2017, 12, 1, 12, 40, 5), 'Data XXX', '')
('Data YYY', datetime.datetime(2017, 12, 1, 12, 40, 6), 'Data XXX', '')
('Data YYY', datetime.datetime(2017, 12, 1, 12, 40, 7), 'Data XXX', '')
('Data YYY', datetime.datetime(2017, 12, 1, 12, 40, 8), 'Data XXX', '')

('Data YYY', datetime.datetime(2017, 12, 1, 12, 40, 9), 'Data XXX', '')
('Data YYY', datetime.datetime(2017, 12, 1, 12, 40, 10), 'Data XXX', '')
('Data YYY', datetime.datetime(2017, 12, 1, 12, 40, 11), 'Data XXX', '')
('Data YYY', datetime.datetime(2017, 12, 1, 12, 40, 12), 'Data XXX', '')

('Data YYY', datetime.datetime(2017, 12, 1, 12, 40, 13), 'Data XXX', '')

Press any key to continue . . .

答案 2 :(得分:0)

首先将您的时间转换为datedate对象。然后可以使用timedelta对象将所需的秒数提前一步。

此脚本会一直读取行,直到到达下一个边界。然后使用起始时间作为文件名将累积的行写入输出CSV文件:

from datetime import datetime, timedelta
import csv

def output_csv(output):
    filename = "{}.csv".format(get_dt(output[0]).strftime("%H_%M_%S"))

    with open(filename, 'w', newline='') as f_output:
        csv_writer = csv.writer(f_output)
        csv_writer.writerow(header)
        csv_writer.writerows(output)

get_dt = lambda x: datetime.strptime(x[1], '%H:%M:%S')
seconds = timedelta(seconds=3)      # set number of seconds to advance 

with open('input.csv', 'r', newline='') as f_input:
    csv_reader = csv.reader(f_input)
    header = next(csv_reader)
    output = [next(csv_reader)]
    read_until = get_dt(output[0]) + seconds

    for row in csv_reader:
        if get_dt(row) >= read_until:
            read_until += seconds
            output_csv(output)
            output = []
        output.append(row)

output_csv(output)

例如,您的第一张CSV为12_40_05.csv

Col 0,Col 1,Col 2,Col 3
Data YYY,12:40:05,Data XXX
Data YYY,12:40:06,Data XXX
Data YYY,12:40:07,Data XXX