Question

我想以特定方式更改CSV文件。这是我的示例CSV文件：

name,time,Operations
Cassandra,2015-10-06T15:07:22.333662984Z,INSERT
Cassandra,2015-10-06T15:07:24.334536781Z,INSERT
Cassandra,2015-10-06T15:07:27.339662984Z,READ
Cassandra,2015-10-06T15:07:28.344493608Z,READ
Cassandra,2015-10-06T15:07:28.345221189Z,READ
Cassandra,2015-10-06T15:07:29.345623750Z,READ
Cassandra,2015-10-06T15:07:31.352725607Z,UPDATE
Cassandra,2015-10-06T15:07:33.360272493Z,UPDATE
Cassandra,2015-10-06T15:07:38.366408708Z,UPDATE

我知道如何使用python解析器从CSV文件中读取，但我完全是初学者。我需要得到这样的输出：

start_time,end_time,operation
2015-10-06T15:07:22.333662984Z,2015-10-06T15:07:24.334536781Z,INSERT    
2015-10-06T15:07:27.339662984Z,2015-10-06T15:07:29.345623750Z,READ
2015-10-06T15:07:31.352725607Z,2015-10-06T15:07:38.366408708Z,UPDATE

注释： 开始时间是在特定查询开始时给出的时间戳（插入/读取，更新），因此结束时间是查询的完成。

感谢。

Answer 1

从您的示例中可以看出，您可以（可能）保证“操作”列中某种类型的第一个条目，并且该类型的最后一个条目是开始和停止时间。如果你不能保证这一点，那么它会稍微复杂一点，但让假设你不能 - 更健壮。

我们可以假设的一件事是CSV中表示的数据是完整的。如果您缺少特定操作的条目，我们可以做的很少。我们还想阅读时间戳，我们可以使用dateutil.parser模块来完成。

因此，我们可以首先设置一个用于跟踪我们的值的短字典，以及一个用于填充字典的函数，该字典一次接受一行。

import dateutil.parser

ops = dict()

def update_ops(opsdict, row):

    # first get the timestamp and op name in a useable format
    timestamp = dateutil.parser.parse(row[1])
    op_name = row[2]

    ## now populate, or update the dictionary
    if op_name not in opsdict:
        # sets a new dict entry with the operation's timestamp.
        # since we don't know what the start time and end time 
        # is yet, for the moment set them both.
        opsdict[op_name] = { 'start_time': timestamp,
                            'end_time': timetstamp }
    else:
        # now evaluate the current timestamp against each start_time
        # and end_time value. Update as needed.
        if opsdict[op_name]['start_time'] > timestamp:
            opsdict[op_name]['start_time'] = timestamp
        if opsdict[op_name]['end_time'] < timestamp:
            opsdict[op_name]['end_time'] = timestamp

现在我们有一个功能来进行排序，运行CSV文件阅读器并填充ops。完成后，我们可以使用字典中的内容生成新的CSV文件。

import csv

cr = csv.reader(open('/path/to/your/file.csv'))
cr_head = cr.next()    # throw away the first row

for row in cr:
    update_ops(ops, row)

# Now write a new csv file – csv.writer is your friend :)
with open('new_operation_times.csv', 'w') as newcsv:
    cw = csv.writer(newcsv)

    # first write your header. csv.writer accepts lists for each row.
    header = 'start_time,end_time,operation'.split(',')
    cw.writerow(header)

    # now write out your dict values. You may want them sorted, 
    # but how to do that has been answered elsewhere on SE.
    for opname, timesdict in ops.items():
        row = [ opname, timesdict['start_time'], timesdict['end_time'] ]
        cw.writerow(row)

你完成了！我试图尽可能详细地说明这一点，以便明确发生了什么。你可以将很多这样的东西折叠成更少，更聪明的步骤（例如从一个csv读取并直接写出来）。但是如果你遵循KISS原则，你以后会更容易阅读，并再次学习。

解析CSV文件并修改列

1 个答案: