处理大型csv文件,从另一个csv,Python中查找值

时间:2014-07-28 18:25:26

标签: python csv

iam解析csv文件,我有一个看起来像这样的csv文件:

SeqID |地理标志
SeqA 123 456
SeqB 999 888 777
...

我现在要做的是涉及第二个文件,它起到十字架的作用 参考,这个看起来如下:

GI |的XID
123 X781
456 X676
789 X123
9999 X217

目的是在功能
的文件中查找每个Seq的GI 作为交叉参考。问题是这个交叉引用文件非常多 大(2.3GB)。到目前为止,我试图解决问题如下:

def map_GI(gilist, mapped):
    with open(gilist) as infile:
      read_gi = csv.reader(infile)
      GI_list = {rows[0]:rows[1:] for rows in read_gi} # read GI list into dictionary
      XID_list = defaultdict(list) # set up XID list as empty dictionary of lists
      infile.close()
    with open(mapping_file) as mapping: # thats the cross reference
      read_mapping = csv.reader(mapping, delimiter='\t') 
      reference_mapping = list(read_reference) # write reference in list
      for k, v in GI_list.items():# iterate over GI list and mapping file
        for row in reference_mapping:
            if row[0] in v:
                XID_list[k].append(row[1]) # write found GOs into dictionary
      mapping.close()
    with open("/output.txt", 'wb') as outfile: # save mapped SeqIDs plus XIDs
      looked_up_go = csv.writer(outfile, delimiter='\t')
      for key, val in XID_list.iteritems():
         looked_up_go.writerow([key] + val)

所需的输出应该是列出原始SeqID和相应的文件的文件 的XID:

SeqID |的XID
SeqA X781 X676

代码有效,但它需要永远(甚至可能更长)。写作
列表中的交叉引用并不是超级聪明的,我知道这一点 我发现了一些相关的问题,但仍然不是我想要的。

我很感激任何意见和建议

1 个答案:

答案 0 :(得分:0)

如果你有一个小文件和一个大文件,通常的答案是找到一种方法来重复迭代大文件,重复迭代小文件(如果可能的话,将其读入内存,重新阅读文件,如果没有),而不是相反。

所以,从这样的事情开始:

def map_GI(gilist, mapped):
    with open(gilist) as infile:
        read_gi = csv.reader(infile)
        GI_list = {rows[0]:rows[1:] for rows in read_gi} # read GI list into dictionary
    with open(mapping_file) as mapping: # thats the cross reference
        read_mapping = csv.reader(mapping, delimiter='\t') 
        for gid, xids in read_mapping:
            for gi_seqid, gi_gids in GI_list:
                GI_list[seqid] = [xids if gi_gid == gid else gi_gid for gi_gid in gi_gids] 
    with open("/output.txt", 'wb') as outfile: # save mapped SeqIDs plus XIDs
        looked_up_go = csv.writer(outfile, delimiter='\t')
        for key, val in XID_list.iteritems():
            looked_up_go.writerow([key] + val)

但是,如果小文件足够小,你可以做得更好:只需构建一个反向映射,这样你就可以查找哪些行必须修改而不是遍历整个列表:

    with open(gilist) as infile:
        read_gi = csv.reader(infile)
        GI_list = {rows[0]:rows[1:] for rows in read_gi} # read GI list into dictionary
        revdict = defaultdict(list)
        for seqid, gids in GI_list.iteritems():
            for gid in gids:
                revdict[gid].append(seqid)
    with open(mapping_file) as mapping: # thats the cross reference
        read_mapping = csv.reader(mapping, delimiter='\t') 
        for gid, xids in read_mapping:
            for seqid in revmap[gid]:
                GI_list[seqid] = [(xids if gi_gid == gid else gi_gid) 
                                  for gi_gid in gi_gids] 

(事实上,即使小文件不足以容纳内存,相同的策略也可以使用dbm代替dict revdict 。)