Question

我有两个文件：file smaller_file （24万行）， greater_file （211万行）

smaller_file 格式

1       hsa-mir-183     Hepatocellular carcinoma        hsa-mir-374a    Hepatocellular carcinoma        0.97866 0       1
2       hsa-mir-374a    Hepatocellular carcinoma        hsa-mir-182     Hepatocellular carcinoma        0.97816 0       1
... (so on)

greater_file 格式

1       hsa-mir-181c    Thyroid carcinoma, papillary    hsa-mir-221     Thyroid carcinoma, papillary    16365291        16365291        -1.00000        1
2       hsa-mir-220a    Thyroid carcinoma, papillary    hsa-mir-221     Thyroid carcinoma, papillary    16365291        16365291        -1.00000        1
... (so on)

带有8列的

smaller_file 和带有9列的 greater_file 。我将两个文件中的行2与5进行比较，如果它们相同，我需要将 greater_file 中的列8值替换为列{{1 smaller_file 中的值。这是我到目前为止：

我正在使用clines = set("\t".join(cline.split('\t')[1:5]) for cline in open(smaller_file)) print "set created!" with open(larger_file) as a: with open("scores.txt", "w") as result: for line in a: line = line.split('\t') look_for = "\t".join(line[1:5]) if look_for in clines: # below line is incomplete as I don't have the 6th column value from smaller_file result.write("\t".join(line[1:7]) + "\t" + line[len(line)-1].split("\n")[0] + "\n")操作来避免带有两个for循环的 O（n ²）。但是，我无法使用set操作从 smaller_file 中捕获6th列。这样做会使set行中的比较变得困难，因为9th操作不支持索引。我可以在set行之后使用另一个for循环来查找9th列，但这会增加复杂性并违背6th操作的目的。

任何帮助解决这个问题都表示赞赏。

Answer 1

我建议使用字典而不是集合来读取＆＃39; smaller_file＆＃39;中的值。

https://docs.python.org/2/library/stdtypes.html#dict

您的代码将如下所示：

clines_dict = {} #open an empty dictionary

with open(smaller_file) as b:
    for line in b:
        clines_dict["\t".join(line.split('\t')[1:5])] = [line.split('\t')[5]]

# clines_dict[key] = [value]

print "dictionary created!"

with open(larger_file) as a:
    with open("scores.txt", "w") as result:
        for line in a:
            line = line.split('\t')
            look_for = "\t".join(line[1:5])
            if look_for in clines_dict: #check if look_for is a key in your dictionary
                result.write("\t".join(line[1:7]) + "\t" + clines_dict[look_for][0] + "\n")

在循环的最后一步，你检查＆＃34; look_for＆＃34;是字典中的键，如果是，则检索属于该键的值（第6列）。另一种做这种事情的好方法是使用sql。

比较两个文件中的多个列，将文件2中的值替换为文件1中的值

1 个答案: