Question

如果我有两个具有DNA序列的文件，并且一切都井井有条（使用ID，因此找到正确的序列很容易），那么如何将这两行合并为1个共识？下面的示例（不使用DNA序列，因此更易于阅读）

注意：所有ID均以相同顺序相同，并且序列长度相同。例如，如果我的文件A具有：

>id1
THISISA-----
>id2
HELLO-------
>id3
TESTTESTTEST

第二个文件B带有：

>id1
-------TEST!
>id2
-----WORLD!!
>id3
TESTTESTTEST

我的理想输出是简单的（在新文件C中）：

>id1
THISISATEST!
>id2
HELLOWORLD!!
>id3
TESTTESTTEST

我在python中使用字符串很糟糕，到目前为止，我已经设法用readlines打开每个文件并保存内容。本质上，空格用“-”标识，如果两个文件中都有一个字符可以代替连字符，我希望它做到这一点。

只知道有关如何启动的提示，我除了以下以外没有其他代码可以提供：

import os
import sys
file1 = sys.argv[1]
file2 = sys.argv[2]

file1_seqs = []
file1_ids = []

with open(file1, "r") as f1:
    content1 = f1.readlines()
for i in range(len(content1)):
    if i % 2 == 1: # get the DNA sequence
        msa1_seqs.append(content1[i])
    else:
        msa1_ids.append(content1[i])

重复上述代码以打开第二个文件（file2），并将文本保留在列表msa2_seqs和msa2_ids中。现在，我只能尝试同时调用write元素，因此，如果存在其他任何字符，我可以创建另一个循环以将“-”更改为字符。

Answer 1

您可以先在collections.defaultdict中通过>id{int}收集行，然后将分组的行输出到文件中。如果您还有两个以上的文件，则此方法也将起作用。

似乎您也不想连接相同的字符串。如果是这种情况，并且您还希望保留顺序，则可以仅使用键来使用Python标准库中的collections.OrderedDict。

但是，与Python 3.7（和CPython 3.6）一样，标准dict是guaranteed to preserve order。如果这是您正在使用的python版本，则无需使用OrderedDict，否则出于可移植性原因，您可以继续使用它。

演示：

from collections import defaultdict
from collections import OrderedDict

def collect_lines(dic, file, key, delim):
    curr_key = None

    for line in file:
        line = line.strip()

        # Check if new key has been found
        if line.startswith(key):
            curr_key = line
            continue

        # Otherwise add line with delim replaced
        dic[curr_key].append(line.replace(delim, ""))

d = defaultdict(list)

files = ["A.txt", "B.txt"]

# Collect lines from each file
for file in files:
    with open(file) as fin:
        collect_lines(dic=d, file=fin, key=">id", delim="-")

# Write new content to output
with open("output.txt", mode="w") as fout:
    for k, v in d.items():
        fout.write("%s\n%s\n" % (k, "".join(OrderedDict.fromkeys(v))))

output.txt ：

>id1
THISISATEST!
>id2
HELLOWORLD!!
>id3
TESTTESTTEST

Answer 2

您可以逐行遍历两个输入文件并同时写入输出文件。这是file_a.txt：

>id1
THISISA-----
>id2
HELLO-------
>id3
TESTTESTTEST

这是file_b.txt：

>id1
-------TEST!
>id2
-----WORLD!!
>id3
TESTTESTTEST

代码如下：

#!/usr/bin/env python3
def merge(file_a, file_b, file_c, gap='-'):

    with open(file_a) as fa, open(file_b) as fb, open(file_c, 'w') as fc:

        for line_a, line_b in zip(fa, fb):

            if line_a.startswith('>id'):
                fc.write(line_a)
                continue

            s = ''.join(a if a != gap else b for a, b in zip(line_a, line_b))
            fc.write(s)


if __name__ == '__main__':
    merge('file_a.txt', 'file_b.txt', 'file_c.txt')

这是生成的file_c.txt的内容：

>id1
THISISATEST!
>id2
HELLOWORLD!!
>id3
TESTTESTTEST

请注意，使用这种方法，您不必在处理之前将文件的全部内容加载到内存中。如果您的DNA文件很大，这很重要。

如何合并具有相同索引的两个文件中的字符串？

2 个答案: