根据分隔符前的第一个id,在csv文件中完全删除重复的条目行?

时间:2017-01-20 06:29:44

标签: python csv

这里我有一个csv文件:

b5711586dc018c1deed6b1ea596da304|f4e3945da368711abb3110b621ceada5c21c11f8|bdf7f718f579d64060c7739225de573e4ffda7fe8b10cdaaeb672de5b7c06  98e|2017-01-20 11:42:12|111|Relative|path
1beb1d0ac2d24cb87d8fe6ce05601136|f5ace00777f68909d106719629c85fb3af23b810|62f6ebb14ede7a1b6307cea5f58a18ff59282650af750a575d1bdb530c04f  11f|2017-01-20 11:42:12|111|Relative|path
b5711586dc018c1deed6b1ea596da304|f4e3945da368711abb3110b621ceada5c21c11f8|bdf7f718f579d64060c7739225de573e4ffda7fe8b10cdaaeb672de5b7c06  98e|2017-01-20 11:43:28|111|Relative|path
1beb1d0ac2d24cb87d8fe6ce05601136|f5ace00777f68909d106719629c85fb3af23b810|62f6ebb14ede7a1b6307cea5f58a18ff59282650af750a575d1bdb530c04f  11f|2017-01-20 11:43:28|111|Relative|path
b5711586dc018c1deed6b1ea596da304|f4e3945da368711abb3110b621ceada5c21c11f8|bdf7f718f579d64060c7739225de573e4ffda7fe8b10cdaaeb672de5b7c06  98e|2017-01-20 11:48:03|111|Relative|path
1beb1d0ac2d24cb87d8fe6ce05601136|f5ace00777f68909d106719629c85fb3af23b810|62f6ebb14ede7a1b6307cea5f58a18ff59282650af750a575d1bdb530c04f  11f|2017-01-20 11:48:03|111|Relative|path

但是我想删除多余的行并保留唯一的行。

有没有办法在python中编写脚本来实现这个目的? 我使用了以下脚本:

import csv
with open('results/20_01_2017_db_file.csv','rb') as f:
        reader = csv.reader(f)
        for row in reader:
                print ', '.join(row)

6 个答案:

答案 0 :(得分:2)

with open('results/20_01_2017_db_file.csv','r') as in_file, open('results/20_01_2017_db_unique_file.csv','w') as out_file:
    dupl = set()
    for line in in_file:
        if line in dupl: 

        dupl.add(line)
        out_file.write(line)

答案 1 :(得分:2)

您可以将行读取为常规行,而不是以逗号分隔格式读取行,您可以将行散列为集合。

这应该适合你:

with open('results/20_01_2017_db_file.csv','rb') as f:
    line_set = set(f)

with open('results/20_01_2017_db_file_v2.csv', 'wb') as f:
    for line in line_set: f.write(line) 

答案 2 :(得分:1)

这样做:

import csv
new_rows = set()
with open('results/20_01_2017_db_file.csv','rb') as f:
    reader = csv.reader(f, delimiter='|')
    [new_rows.add(row) for row in reader]

with open('results/20_01_2017_db_fileUniq.csv', 'wb') as fout:
    [fout.writeline(r) for r in list(new_rows)]

答案 3 :(得分:1)

使用set记住已经看过的所有行,并且只打印那些尚未出现的行:

import csv
with open('a.csv','rb') as f:
  reader = csv.reader(f, delimiter='|')  # need to specify delimiter
  rows_seen = set()
  for row in reader:
    row_key = row[0]
    if row_key not in rows_seen:
      print ', '.join(row) 
    rows_seen.add(row_key)

另请注意,您需要明确指定分隔符(|),因为它是非默认分隔符。

答案 4 :(得分:1)

尝试这个

import csv
data = []
with open('results/20_01_2017_db_file.csv','rb') as f:
    reader = csv.reader(f)
    for row in reader:
        if not row in data:
            data.append(row)

答案 5 :(得分:0)

您只能使用列表生成包含唯一行的新文件:

def unique(input_file_path, output_file_path):
    unique_ids = []
    with open(input_file_path) as in_file, open(output_file_path, 'w') as out_file:
        for line in in_file:
            tokens = line.split('|',1)
            if tokens[0] not in unique_ids:
                unique_ids.append(tokens[0])
                out_file.write(line)

这样称呼:

unique('path/to/input','path/to/output')