Python从csv文件中提取唯一记录

时间:2014-02-15 00:12:39

标签: python

我有一个csv文件,我希望从中保留唯一记录。在这个文件中,我有第4个字段,其中包含一些文本,然后是人或鼠标名称。喜欢... RHPN1_HUMAN和EPHA5_MOUSE

例如:EPHA5同时出现在人类和鼠标中,因此我想删除此记录,因为RHPN1只出现在人类身上,所以我想保留此记录。

file1.csv

meNOG00001  9606    ENSP00000289013         RHPN1_HUMAN

meNOG00005  10090   ENSMUSP00000060646  EPHA5_MOUSE

meNOG00005  9606    ENSP00000273854         EPHA5_HUMAN

meNOG00006  10090   ENSMUSP00000082503  RGPA1_MOUSE

meNOG00006  9606    ENSP00000202677         RGPA2_HUMAN

meNOG00006  9606    ENSP00000302647         RGPA1_HUMAN

meNOG00010  9606    ENSP00000253669         HAUS8_HUMAN

meNOG00011  10090   ENSMUSP00000017629  TOP2B_MOUSE

meNOG00011  10090   ENSMUSP00000068896  TOP2A_MOUSE

meNOG00011  9606    ENSP00000396704         TOP2B_HUMAN

meNOG00011  9606    ENSP00000411532         TOP2A_HUMAN

output.csv

meNOG00001  9606    ENSP00000289013         RHPN1_HUMAN

meNOG00006  9606    ENSP00000202677         RGPA2_HUMAN

meNOG00010  9606    ENSP00000253669         HAUS8_HUMAN

我试过,但我的代码没有按照我想要的方式工作......

file1 = open("file1.csv", "rU")
reader1 = csv.reader(file1,delimiter=',')

d =[]
c =[]
for row in reader1:
    d.append(row[3].split('_')[0])
d=list(set(d))

for row1 in d:
    for row2 in reader1:
        if row1 == row2[3].split('_')[0]:
               c.append(row2)

    file1.seek(0)

with open('output.csv', 'w') as f_out:
    writer = csv.writer(f_out, delimiter=',')
    for k in c:
        writer.writerow(k)

2 个答案:

答案 0 :(得分:1)

import csv
import collections
data = collections.OrderedDict()            # 2
with open("file1.csv", "rU") as f:
    reader = csv.reader(f, delimiter=',')
    for row in reader:
        key = row[3].split('_')[0]
        if key in data:
            del data[key]                   # 1
        else:
            data[key] = row                 

with open('output.csv', 'w') as f_out:
    writer = csv.writer(f_out, delimiter=',')
    writer.writerows(data.values())
  1. 如果多次看到该键,则从该词典中删除该项目。只要密钥可以最多两次,就可以删除重复项。
  2. 使用OrderDict以使线条保持有序。如果不是这样的话 对您很重要,您可以使用常规dict

  3. 如果密钥可以出现两次以上,那么您将需要一种不同的方式来跟踪已经看到的密钥。你可以使用一套。例如,

    import csv
    import collections
    seen = set()
    data = collections.OrderedDict()            
    with open("file1.csv", "rU") as f:
        reader = csv.reader(f, delimiter=',')
        for row in reader:
            key = row[3].split('_')[0]
            if key in seen:
                del data[key]
            else:
                data[key] = row                 
                seen.add(key)
    
    with open('output.csv', 'w') as f_out:
        writer = csv.writer(f_out, delimiter=',')
        writer.writerows(data.values())
    

答案 1 :(得分:0)

未完整测试,但您可以使用以下内容:

class OD(OrderedDict):
    coll = set()
    def __setitem__(self, key, value):
        if key in self.coll:
            try:
                del self[key]
            except KeyError:
                pass
        else:
            OrderedDict.__setitem__(self, key, value)
            self.coll.add(key)

原因是我不确定你是否会有超过2场比赛。例如,如果您有奇数个匹配代码,则无法与字典中的键匹配 - 因为任何奇数个键都将被视为唯一。但是上面的说法会有效。 (虽然这可能有点过分了)

d = OD()

with open("file1.csv", "rU") as f_in:
    reader = csv.reader(f_in, delimiter=',')
    for row in reader:
        key = row[3].split('_')[0]
        d[key] = row

with open('output.csv', 'w') as f_out:
    writer = csv.writer(f_out, delimiter=',')
    for val in d.values():
        writer.writerow(val)