我希望在原始文本文件中识别重复项,一旦识别出重复项,我想在创建新的CSV文件时忽略它。
raw_file_reader = csv.DictReader(open(raw_file), delimiter='|')
保存在我的原始文件中是一个简单的.txt文件
with open('file') as f:
seen = set()
for line in f:
line_lower = line.lower()
if line_lower in seen:
print(line)
else:
seen.add(line_lower)
我可以使用集合
找到重复项for row in raw_file_reader:
if ('Symbol' in row):
symbol = row['Symbol']
elif ('SYMBOL' in row):
symbol = row['SYMBOL']
else:
raise exception
if symbol not in symbol_lookup:
continue
我只是不确定在转换为csv文件之前如何实际忽略重复项。
答案 0 :(得分:0)
我会使用csv
库来执行此操作。此外,还有一种内置的枚举项目的方法。所以,让我们使用它。
import csv
with open("in.txt","r") as fi, open("out.csv","w") as fo:
writer = csv.writer(fo, lineterminator='\n')
writer.writerows(enumerate(set(fi.read().split("|"))))
答案 1 :(得分:0)
您可以按照以下步骤将所有条目存储在一个集合中来删除重复项:
import csv
seen = set()
output = []
source_file = "file.csv"
with open(source_file, 'rb') as f_input:
csv_input = csv.reader(f_input, delimiter='|')
for row in csv_input:
if tuple(row) not in seen:
output.append(row)
seen.add(tuple(row))
with open(source_file, 'wb') as f_output:
csv_output = csv.writer(f_output)
csv_output.writerows(output)
给你一个输出文件:
20100830,TECD,1500,4300,N
20100830,TECH,100,100,N
20100830,TECUA,100,391,N
20100830,TEF,1300,1300,N
20100830,TEG,900,1900,N
这通过将每个整行转换为元组来实现,然后可以将其存储为集合。这样可以直接测试重复行。
在Python 2.7.12上测试
答案 2 :(得分:0)
您可以简单地创建一个自定义迭代器,它将返回删除重复项的原始文件行:
class Dedup:
def __init__(self, fd):
self.fd = fd # store the original file object
self.seen = set() # initialize an empty set for lines
def __next__(self): # the iterator method
while True:
line = next(self.fd)
if not line in self.seen:
self.seen.add(line)
return line
# print("DUP>", line.strip(), "<") # uncomment for tests
def __iter__(self): # make the iterator compatible Python 2 and 3
return self
def next(self):
return self.__next__()
def __enter__(self): # make it a context manager supporting with
return self
def __exit__(self, typ, value, traceback):
self.fd.close() # cleanup
然后您可以简单地创建您的DictReader:
with Dedup(open(raw_file)) as fd:
reader = csv.DictReader(fd, delimiter='|')
for row in reader:
# process each now unique row...
但要注意!这将要求所有行都存储在集合中,这意味着原始文件必须适合内存。