iam解析csv文件,我有一个看起来像这样的csv文件:
SeqID |地理标志
SeqA 123 456
SeqB 999 888 777
...
我现在要做的是涉及第二个文件,它起到十字架的作用 参考,这个看起来如下:
GI |的XID
123 X781
456 X676
789 X123
9999 X217
目的是在功能
的文件中查找每个Seq的GI
作为交叉参考。问题是这个交叉引用文件非常多
大(2.3GB)。到目前为止,我试图解决问题如下:
def map_GI(gilist, mapped):
with open(gilist) as infile:
read_gi = csv.reader(infile)
GI_list = {rows[0]:rows[1:] for rows in read_gi} # read GI list into dictionary
XID_list = defaultdict(list) # set up XID list as empty dictionary of lists
infile.close()
with open(mapping_file) as mapping: # thats the cross reference
read_mapping = csv.reader(mapping, delimiter='\t')
reference_mapping = list(read_reference) # write reference in list
for k, v in GI_list.items():# iterate over GI list and mapping file
for row in reference_mapping:
if row[0] in v:
XID_list[k].append(row[1]) # write found GOs into dictionary
mapping.close()
with open("/output.txt", 'wb') as outfile: # save mapped SeqIDs plus XIDs
looked_up_go = csv.writer(outfile, delimiter='\t')
for key, val in XID_list.iteritems():
looked_up_go.writerow([key] + val)
所需的输出应该是列出原始SeqID和相应的文件的文件 的XID:
SeqID |的XID
SeqA X781 X676
代码有效,但它需要永远(甚至可能更长)。写作
列表中的交叉引用并不是超级聪明的,我知道这一点
我发现了一些相关的问题,但仍然不是我想要的。
我很感激任何意见和建议
答案 0 :(得分:0)
如果你有一个小文件和一个大文件,通常的答案是找到一种方法来重复迭代大文件,重复迭代小文件(如果可能的话,将其读入内存,重新阅读文件,如果没有),而不是相反。
所以,从这样的事情开始:
def map_GI(gilist, mapped):
with open(gilist) as infile:
read_gi = csv.reader(infile)
GI_list = {rows[0]:rows[1:] for rows in read_gi} # read GI list into dictionary
with open(mapping_file) as mapping: # thats the cross reference
read_mapping = csv.reader(mapping, delimiter='\t')
for gid, xids in read_mapping:
for gi_seqid, gi_gids in GI_list:
GI_list[seqid] = [xids if gi_gid == gid else gi_gid for gi_gid in gi_gids]
with open("/output.txt", 'wb') as outfile: # save mapped SeqIDs plus XIDs
looked_up_go = csv.writer(outfile, delimiter='\t')
for key, val in XID_list.iteritems():
looked_up_go.writerow([key] + val)
但是,如果小文件足够小,你可以做得更好:只需构建一个反向映射,这样你就可以查找哪些行必须修改而不是遍历整个列表:
with open(gilist) as infile:
read_gi = csv.reader(infile)
GI_list = {rows[0]:rows[1:] for rows in read_gi} # read GI list into dictionary
revdict = defaultdict(list)
for seqid, gids in GI_list.iteritems():
for gid in gids:
revdict[gid].append(seqid)
with open(mapping_file) as mapping: # thats the cross reference
read_mapping = csv.reader(mapping, delimiter='\t')
for gid, xids in read_mapping:
for seqid in revmap[gid]:
GI_list[seqid] = [(xids if gi_gid == gid else gi_gid)
for gi_gid in gi_gids]
(事实上,即使小文件不足以容纳内存,相同的策略也可以使用dbm
代替dict
revdict
。)