我有两个文件:file smaller_file (24万行), greater_file (211万行)
smaller_file 格式
1 hsa-mir-183 Hepatocellular carcinoma hsa-mir-374a Hepatocellular carcinoma 0.97866 0 1
2 hsa-mir-374a Hepatocellular carcinoma hsa-mir-182 Hepatocellular carcinoma 0.97816 0 1
... (so on)
greater_file 格式
1 hsa-mir-181c Thyroid carcinoma, papillary hsa-mir-221 Thyroid carcinoma, papillary 16365291 16365291 -1.00000 1
2 hsa-mir-220a Thyroid carcinoma, papillary hsa-mir-221 Thyroid carcinoma, papillary 16365291 16365291 -1.00000 1
... (so on)
带有8
列的 smaller_file 和带有9
列的 greater_file 。我将两个文件中的行2
与5
进行比较,如果它们相同,我需要将 greater_file 中的列8
值替换为列{{1 smaller_file 中的值。这是我到目前为止:
6
我正在使用clines = set("\t".join(cline.split('\t')[1:5]) for cline in open(smaller_file))
print "set created!"
with open(larger_file) as a:
with open("scores.txt", "w") as result:
for line in a:
line = line.split('\t')
look_for = "\t".join(line[1:5])
if look_for in clines:
# below line is incomplete as I don't have the 6th column value from smaller_file
result.write("\t".join(line[1:7]) + "\t" + line[len(line)-1].split("\n")[0] + "\n")
操作来避免带有两个for循环的 O(n 2 )。但是,我无法使用set
操作从 smaller_file 中捕获6th
列。这样做会使set
行中的比较变得困难,因为9th
操作不支持索引。我可以在set
行之后使用另一个for循环来查找9th
列,但这会增加复杂性并违背6th
操作的目的。
任何帮助解决这个问题都表示赞赏。
答案 0 :(得分:1)
我建议使用字典而不是集合来读取' smaller_file'中的值。
https://docs.python.org/2/library/stdtypes.html#dict
您的代码将如下所示:
clines_dict = {} #open an empty dictionary
with open(smaller_file) as b:
for line in b:
clines_dict["\t".join(line.split('\t')[1:5])] = [line.split('\t')[5]]
# clines_dict[key] = [value]
print "dictionary created!"
with open(larger_file) as a:
with open("scores.txt", "w") as result:
for line in a:
line = line.split('\t')
look_for = "\t".join(line[1:5])
if look_for in clines_dict: #check if look_for is a key in your dictionary
result.write("\t".join(line[1:7]) + "\t" + clines_dict[look_for][0] + "\n")
在循环的最后一步,你检查" look_for"是字典中的键,如果是,则检索属于该键的值(第6列)。另一种做这种事情的好方法是使用sql。