Question

例如，我有一个带有以下内容的fasta文件序列：

>human1
AGGGCGSTGC
>human2
GCTTGCGCTAG
>human3
TTCGCTAG

如何使用python读取具有以下内容的文本文件来提取序列？ 1表示真，0表示假。仅序列值为1 将被提取。

示例文本文件：

0
1
1

预期产出：

>human2
GCTTGCGCTAG
>human3
TTCGCTAG

Answer 1

为此，最好使用biopython

from Bio import SeqIO

mask = ["1"==_.strip() for _ in open("mask.txt")]
seqs = [seq for seq in SeqIO.parse(open("input.fasta"), "fasta")]
seqs_filter = [seq for flag, seq in zip(mask, seqs) if flag]
for seq in seqs_filter:
  print seq.format("fasta")

你得到：

>human2
GCTTGCGCTAG
>human3
TTCGCTAG

<强>解释

解析fasta：格式fasta可能有几行序列（检查fasta format），最好使用专门的库来读取（解析器）并写入输出

mask：我读取de mask文件并转换为布尔值[False, True, True]

过滤器：使用zip函数为每个序列匹配他的面具，然后我使用list comprehensions过滤

Answer 2

我认为这可能会对您有所帮助，我认为您应该花些时间学习Python。 Python是生物信息学的好语言。

display = []
with open('test.txt') as f:
    for line in f.readlines():
        display.append(int(line.strip()))

output_DNA = []
with open('XX.fasta') as f:
    index = -1
    for line in f.readlines():
        if line[0] == '>':
            index = index + 1

        if display[index]:
            output_DNA.append(line)

print output_DNA

Answer 3

您可以创建一个列表，以便在您阅读fasta文件时充当掩码：

with open('mask.txt') as mf:
    mask = [ s.strip() == '1' for s in mf.readlines() ]

然后：

with open('seq.fasta') as f:
    for i, line in enumerate(f):
        if mask[i]:
            *something* line

或：

from itertools import izip

for b, line in izip(open(mask_file), open(seq_file)):
    if b.strip() == '1':
          *something* line

Answer 4

我不熟悉fasta文件格式，但希望这会有所帮助。您可以通过以下方式在python中打开文件，并在列表中提取有效的行条目。

valid = []
with open('test.txt') as f:
    all_lines = f.readlines() # get all the lines
    all_lines = [x.strip() for x in all_lines] # strip away newline chars
    for i in range(len(all_lines)):
        if all_lines[i] == '1': # if it matches our condition
            valid.append(i) # add the index to our list

    print valid # or get only the fasta file contents on these lines

我使用以下文本文件test.txt运行它：

打印valid时获得输出：

[1, 2, 3, 6, 7]

我认为这有助于您继续前进，但如果您需要扩大答案，请在评论中告诉我。

Python：如何基于具有二进制内容的文本文件提取DNA序列？

4 个答案: