在csv文件中标记重复项

时间:2009-11-14 03:42:42

标签: python csv duplicates

我很难解决下面示例中说明的问题:

"ID","NAME","PHONE","REF","DISCARD"
1,"JOHN",12345,,
2,"PETER",6232,,
3,"JON",12345,,
4,"PETERSON",6232,,
5,"ALEX",7854,,
6,"JON",12345,,

我想检测列“PHONE”中的重复项,并使用“REF”列标记后续重复项,其值指向第一项的“ID”,“DISCARD”的值为“是”柱

"ID","NAME","PHONE","REF","DISCARD"
1,"JOHN",12345,1,
2,"PETER",6232,2,
3,"JON",12345,1,"Yes"
4,"PETERSON",6232,2,"Yes"
5,"ALEX",7854,,
6,"JON",12345,1,"Yes"

那么,我该怎么做呢? 我尝试了这段代码,但当然我的逻辑不正确。

import csv
myfile = open("C:\Users\Eduardo\Documents\TEST2.csv", "rb")
myfile1 = open("C:\Users\Eduardo\Documents\TEST2.csv", "rb")

dest = csv.writer(open("C:\Users\Eduardo\Documents\TESTFIXED.csv", "wb"), dialect="excel")

reader = csv.reader(myfile)
verum = list(reader)
verum.sort(key=lambda x: x[2])
for i, row in enumerate(verum):
    if row[2] == verum[i][2]:
        verum[i][3] = row[0]

print verum

非常感谢您的指导和帮助。

5 个答案:

答案 0 :(得分:7)

在运行时,您唯一需要留在内存中的是电话号码与其ID的映射。

map = {}
with open(r'c:\temp\input.csv', 'r') as fin:
    reader = csv.reader(fin)
    with open(r'c:\temp\output.csv', 'w') as fout:
        writer = csv.writer(fout)
        # omit this if the file has no header row
        writer.writerow(next(reader))
        for row in reader:
            (id, name, phone, ref, discard) = row
            if map.has_key(phone):
                ref = map[phone]
                discard = "YES"
            else:
                map[phone] = id
            writer.writerow((id, name, phone, ref, discard))

答案 1 :(得分:0)

听起来像是家庭作业。由于这是一个CSV文件(因此几乎不可能改变记录大小),所以最好将整个文件加载到内存中并在将其写入新文件之前对其进行操作。创建一个字符串列表,该列表是文件的原始行。然后创建一个地图,插入电话号码(密钥)和值(id)。在插入之前,如果已存在,则查找该号码,更新包含重复电话号码的行。如果它不在地图中,则插入(phone,id)对。

答案 2 :(得分:0)

from operator import itemgetter
from itertools import groupby

import csv
verum = csv.reader(open('data.csv','rb'))

verum.sort(key=itemgetter(2,0))
def grouper( verum ):
    for key, grp in groupby(verum,itemgetter(2)):
        # key = phone number, grp = records with that number
        first = grp.next()
        # first item gets its id written into the 4th column
        yield [first[0],first[1],first[2],first[0],''] #or list(itemgetter(0,1,2,0,4)(first)) 
        for x in grp:
            # all others get the first items id as ref
            yield [x[0],x[1],x[2], first[0], "Yes"]

for line in sorted(grouper(verum), key=itemgetter(0)):
    print line

输出:

['1', 'JOHN', '12345', '1', '']
['2', 'PETER', '6232', '2', '']
['3', 'JON', '12345', '1', 'Yes']
['4', 'PETERSON', '6232', '2', 'Yes']
['5', 'ALEX', '7854', '5', '']
['6', 'JON', '12345', '1', 'Yes']

将数据写回给读者; - )

答案 3 :(得分:0)

我知道一件事。我知道您不必将整个文件读入内存即可实现此目的。

import csv
myfile = "C:\Users\Eduardo\Documents\TEST2.csv"

dest = csv.writer(open("C:\Users\Eduardo\Documents\TESTFIXED.csv", "wb"), dialect="excel")

phonedict = {}

for row in cvs.reader(open(myfile, "r")):
    # setdefault sets the value to the second argument if it hasn't been set, and then
    # returns what the value in the dictionary is.
    firstid = phonedict.setdefault(row[2], row[0])
    row[3] = firstid
    if firstid is not row[0]:
       row[4] = "Yes"
    dest.writerow(row)

答案 4 :(得分:0)

我使用大型40k以上的记录csv文件,这是使用Access摆脱欺骗的最简单方法。 1.创建新数据库, 2,Tables选项卡获取外部数据 3.保存表格。 4.查询选项卡新查找对话向导(在电话字段上匹配,显示所有字段和计数) 5.保存查询(导出具有.txt但名称为dupes.txt) 6.导入查询结果作为新表,不导入带有重复计数的字段。 7.查询查找不匹配(按电话字段匹配,显示结果中的所有字段。保存查询,然后导出具有.txt但名称唯一.txt) 8.将唯一文件导入现有表(欺骗) 9.您现在可以保存并再次导出到您需要的文件类型,而不是任何欺骗