比较文件中2行的特定列

时间:2017-06-25 12:54:50

标签: python file csv

检查文件中第1列和第2列的2行是否具有相同的值,如果没有,则将该行添加到另一行'输出'文件,如果它们相同,那么基于第三列(时间戳),最新的文件被添加到'输出'文件。

下面的代码段会比较整行,而不是列,我该如何对列进行比较?

#!/usr/bin/python
import os,sys,csv

file_open= sys.argv[1]
    with open (file_open,'r') as f1, open ('output.txt','w+') as f2:

    lines2 = f2.readlines()
    for line in f1:
            if line not in lines2:
                    f2.write(line)

输入

1,A,28/04/17 10:57:28.096

3,A,28/04/17 10:57:46.950

1,A,28/04/17 10:59:16.969

3,A,28/04/17 11:02:09.341

4,A,28/04/17 11:03:09.432

预期输出

1,A,28/04/17 10:59:16.969

3,A,28/04/17 11:02:09.341

4,A,28/04/17 11:03:09.432

3 个答案:

答案 0 :(得分:0)

由于您要导入csv模块,我建议您使用它。

import sys 
import csv

seen = set()

file_open = sys.argv[1]
with open(file_open, 'r') as f1, open('output.txt','w') as f2:
    reader = csv.reader(f1)
    writer = csv.writer(f2)

    for line in reader:
        if not len(line): # a quick check to make sure it's a valid line
            continue

        if (line[0], line[1]) not in seen:
            seen.add((line[0], line[1]))
            writer.writerow(line)

此代码检查以确保在写入之前已经看不到具有相同第一列和第二列的行。元组是可以清洗的,所以这很容易做到。

输出:

1,A,28/04/17 10:57:28.096
3,A,28/04/17 10:57:46.950
4,A,28/04/17 11:03:09.432

答案 1 :(得分:0)

@Coldspeed's code的修改版本,使用OrderedDict按时间戳保留最新的条目(假设时间戳按顺序排列)。

import sys 
import csv

from collections import OrderedDict

history = OrderedDict()

file_open = sys.argv[1]
with open(file_open, 'r') as f1, open('output.txt','w') as f2:
    reader = csv.reader(f1)
    writer = csv.writer(f2)

    for line in reader:
        if not len(line): # valid line check
            continue
        history[(line[0], line[1])] = line[2] # Adds if present, updates if new

    for line in list(history.items()):
        writer.writerow([line[0][0], line[0][1], line[1]])

output.txt的内容:

1,A,28/04/17 10:59:16.969
3,A,28/04/17 11:02:09.341
4,A,28/04/17 11:03:09.432

答案 2 :(得分:0)

使用itertools.groupby()函数和datetime模块的简短解决方案(比较 date 字符串):

import sys, csv, itertools, datetime, operator

with open(sys.argv[1], 'r') as in_csv, open('output.csv', 'w') as out_csv:
    reader = csv.reader(in_csv)
    lines = [ max(g, key=lambda x: datetime.datetime.strptime(x[2], '%d/%m/%y %H:%M:%S.%f'))
              for k,g in itertools.groupby(sorted(reader, key=lambda r: (r[0], r[1])), key=operator.itemgetter(0,1))]

    writer = csv.writer(out_csv, lineterminator='\n')
    for l in lines:
        writer.writerow(l)

output.csv 内容:

1,A,28/04/17 10:59:16.969
3,A,28/04/17 11:02:09.341
4,A,28/04/17 11:03:09.432