我有一个关于在Python中删除重复项的问题。我已经阅读了很多帖子但还没有解决它。我有以下csv文件:
修改
输入:
ID, Source, 1.A, 1.B, 1.C, 1.D
1, ESPN, 5,7,,,M
1, NY Times,,10,12,W
1, ESPN, 10,,Q,,M
输出应为:
ID, Source, 1.A, 1.B, 1.C, 1.D, duplicate_flag
1, ESPN, 5,7,,,M, duplicate
1, NY Times,,10,12,W, duplicate
1, ESPN, 10,,Q,,M, duplicate
1, NY Times, 5 (or 10 doesn't matter which one),7, 10, 12, W, not_duplicate
在单词中,如果ID相同,则从具有“NY Times”源的行中获取值,如果具有“NY Times”的行具有空值并且来自“ESPN”源的重复行具有值对于该单元格,从“ESPN”源的行中获取值。对于输出,将原始的两行标记为重复,并创建第三行。
为了进一步澄清,因为我需要在许多具有不同列标题的不同csv文件上运行此脚本,所以我不能执行以下操作:
def main():
with open(input_csv, "rb") as infile:
input_fields = ("ID", "Source", "1.A", "1.B", "1.C", "1.D")
reader = csv.DictReader(infile, fieldnames = input_fields)
with open(output_csv, "wb") as outfile:
output_fields = ("ID", "Source", "1.A", "1.B", "1.C", "1.D", "d_flag")
writer = csv.DictWriter(outfile, fieldnames = output_fields)
writer.writerow(dict((h,h) for h in output_fields))
next(reader)
first_row = next(reader)
for next_row in reader:
#stuff
因为我希望程序在前两列上运行,而与表中的其他列无关。换句话说,“ID”和“Source”将出现在每个输入文件中,但其余列将根据文件而改变。
非常感谢您提供的任何帮助!仅供参考,“来源”只能是:纽约时报,ESPN或华尔街日报,重复的优先顺序是:如果可用,请选择纽约时报,否则选择ESPN,否则选择华尔街日报。这适用于每个输入文件。
答案 0 :(得分:2)
以下代码将所有记录读入一个大字典,其中的键是其标识符,其值是将源名称映射到整个数据行的字典。然后它遍历字典并为您提供所要求的输出。
import csv
header = None
idfld = None
sourcefld = None
record_table = {}
with open('input.csv', 'rb') as csvfile:
reader = csv.reader(csvfile)
for row in reader:
row = [x.strip() for x in row]
if header is None:
header = row
for i, fld in enumerate(header):
if fld == 'ID':
idfld = i
elif fld == 'Source':
sourcefld = i
continue
key = row[idfld]
sourcename = row[sourcefld]
if key not in record_table:
record_table[key] = {sourcename: row, "all_rows": [row]}
else:
if sourcename in record_table[key]:
cur_row = record_table[key][sourcename]
for i, fld in enumerate(row):
if cur_row[i] == '':
record_table[key][sourcename][i] = fld
else:
record_table[key][sourcename] = row
record_table[key]["all_rows"].append(row)
print ', '.join(header) + ', duplicate_flag'
for recordid in record_table:
rowdict = record_table[recordid]
final_row = [''] * len(header)
rowcount = len(rowdict)
for sourcetype in ['NY Times', 'ESPN', 'Wall Street Journal']:
if sourcetype in rowdict:
row = rowdict[sourcetype]
for i, fld in enumerate(row):
if final_row[i] != '':
continue
if fld != '':
final_row[i] = fld
if rowcount > 1:
for row in rowdict["all_rows"]:
print ', '.join(row) + ', duplicate'
print ', '.join(final_row) + ', not_duplicate'