这里我有一个csv文件:
b5711586dc018c1deed6b1ea596da304|f4e3945da368711abb3110b621ceada5c21c11f8|bdf7f718f579d64060c7739225de573e4ffda7fe8b10cdaaeb672de5b7c06 98e|2017-01-20 11:42:12|111|Relative|path
1beb1d0ac2d24cb87d8fe6ce05601136|f5ace00777f68909d106719629c85fb3af23b810|62f6ebb14ede7a1b6307cea5f58a18ff59282650af750a575d1bdb530c04f 11f|2017-01-20 11:42:12|111|Relative|path
b5711586dc018c1deed6b1ea596da304|f4e3945da368711abb3110b621ceada5c21c11f8|bdf7f718f579d64060c7739225de573e4ffda7fe8b10cdaaeb672de5b7c06 98e|2017-01-20 11:43:28|111|Relative|path
1beb1d0ac2d24cb87d8fe6ce05601136|f5ace00777f68909d106719629c85fb3af23b810|62f6ebb14ede7a1b6307cea5f58a18ff59282650af750a575d1bdb530c04f 11f|2017-01-20 11:43:28|111|Relative|path
b5711586dc018c1deed6b1ea596da304|f4e3945da368711abb3110b621ceada5c21c11f8|bdf7f718f579d64060c7739225de573e4ffda7fe8b10cdaaeb672de5b7c06 98e|2017-01-20 11:48:03|111|Relative|path
1beb1d0ac2d24cb87d8fe6ce05601136|f5ace00777f68909d106719629c85fb3af23b810|62f6ebb14ede7a1b6307cea5f58a18ff59282650af750a575d1bdb530c04f 11f|2017-01-20 11:48:03|111|Relative|path
但是我想删除多余的行并保留唯一的行。
有没有办法在python中编写脚本来实现这个目的? 我使用了以下脚本:
import csv
with open('results/20_01_2017_db_file.csv','rb') as f:
reader = csv.reader(f)
for row in reader:
print ', '.join(row)
答案 0 :(得分:2)
with open('results/20_01_2017_db_file.csv','r') as in_file, open('results/20_01_2017_db_unique_file.csv','w') as out_file:
dupl = set()
for line in in_file:
if line in dupl:
dupl.add(line)
out_file.write(line)
答案 1 :(得分:2)
您可以将行读取为常规行,而不是以逗号分隔格式读取行,您可以将行散列为集合。
这应该适合你:
with open('results/20_01_2017_db_file.csv','rb') as f:
line_set = set(f)
with open('results/20_01_2017_db_file_v2.csv', 'wb') as f:
for line in line_set: f.write(line)
答案 2 :(得分:1)
这样做:
import csv
new_rows = set()
with open('results/20_01_2017_db_file.csv','rb') as f:
reader = csv.reader(f, delimiter='|')
[new_rows.add(row) for row in reader]
with open('results/20_01_2017_db_fileUniq.csv', 'wb') as fout:
[fout.writeline(r) for r in list(new_rows)]
答案 3 :(得分:1)
使用set
记住已经看过的所有行,并且只打印那些尚未出现的行:
import csv
with open('a.csv','rb') as f:
reader = csv.reader(f, delimiter='|') # need to specify delimiter
rows_seen = set()
for row in reader:
row_key = row[0]
if row_key not in rows_seen:
print ', '.join(row)
rows_seen.add(row_key)
另请注意,您需要明确指定分隔符(|
),因为它是非默认分隔符。
答案 4 :(得分:1)
尝试这个
import csv
data = []
with open('results/20_01_2017_db_file.csv','rb') as f:
reader = csv.reader(f)
for row in reader:
if not row in data:
data.append(row)
答案 5 :(得分:0)
您只能使用列表生成包含唯一行的新文件:
def unique(input_file_path, output_file_path):
unique_ids = []
with open(input_file_path) as in_file, open(output_file_path, 'w') as out_file:
for line in in_file:
tokens = line.split('|',1)
if tokens[0] not in unique_ids:
unique_ids.append(tokens[0])
out_file.write(line)
这样称呼:
unique('path/to/input','path/to/output')