我正在尝试使用Biopython从FASTA文件中提取包含与以下短DNA序列匹配的所有DNA序列:" GGCTCAACCCTGGA"
这是我到目前为止所做的:
from Bio import SeqIO
source = "rep_set_no_spaces.fasta"
outfile = "rep_set_PNA_matches.fasta"
seq1 = "GGCTCAACCCTGGA"
# basically a function to check whether seq contains sub1
def seq_check(seq, seq1):
return seq.find(seq1)
seqs = SeqIO.parse(source, 'fasta')
filtered = (seq for seq in seqs if seq_check(seq.seq, seq1))
SeqIO.write(filtered, outfile, 'fasta')
我正在尝试调整此帖子中的代码(Filtering a FASTA file based on sequence with BioPython),但我感兴趣的序列既不是序列的开头也不是结尾......
例如,以下是我的一些序列......第1和第4序列匹配,但第2和第3序列不匹配。我想拉出序列制作一个新的fasta文件,只包含那些包含" GGCTCAACCCTGGA"
的序列>110148arco.1D_184193
TACGGAGGGGGTTAGCGTTGTTCGGAATTACTGGGCGTAAAGCGCACGTAGGTGGATTGGAAAGTATGGGGTGAAATCCCAGGGCTCAACCCTGGAACTGCCTCATAAACTATCAGTCTAGAGTTCGAGAGAGGTGAGTGGAATTCCGAGTGTAGAGGTGAAATTCGTAGATATTCGGAGGAACACCAGTGGCGAAGGCGGCTCACTGGCTCGATACTGACACTGAGGTGCGAAAGTGTGGGGAGCAAACAGG
>110475arco.1D_40770
TACGGAGGGTGCGAGCGTTAATCGGAATTACTGGGCGTAAAGCGCGCGTAGGCGGTTTGTTAAGTCAGCTGTGAAAGCCCTGGGCTCAACCTGGGAATTGCAGTTGATACTGGCAAGCTGGAGTACGAGAGAGGGAGGTAGAATTCCATGTGTAGCGGTGAAATGCGTAGATATATGGAGGAATACCAGTGGCGAAGGCGGCCTCCTGGCTCGATACTGACGCTGAGGTGCGAAAGCGTGGGGAGCAAACAGG
>110484arco.1D_190999
TACGGAGGGTGCAAGCGTTAATCGGAATTACTGGGCGTAAAGCGCGCGTAGGCGGTTTGTTAAGTCAGCTGTGAAAGCCCTGGGCTCAACCTGGGAATTGCAGTTGATACTGATCGACTAGAGTACGAGAGAGGGAGGTAGAATTCCACGTGTAGCGGTGAAATGCGTAGATATGTGGAGGAATACCGGTGGCGAAGGCGGCCTCCTGGCTCGATACTGACGCTGAGGTGCGAAAGCGTGGGGAGCAAACAGG
>110525amin.3D_40107
TACGGAGGGGGCTAGCGTTGTTCGGAATTACTGGGCGTAAAGCGTACGTAGGCGGATTAGTAAGTAAGATGTGAAATCCCAGGGCTCAACCCTGGAACTGCATTTTAAACTGCTAGTCTAGAGTTATGGAGAGGTAAGTGGAATTCCTAGTGTAGAGGTGAAATTCGTAGATATTAGGAGGAACACCAGAGGCGAAGGCGACTTACTGGACATATACTGACGCTGAGGTACGAAAGTGTGGGTAGCAAACAGG
谢谢!
答案 0 :(得分:1)
实际上,这个问题不是关于Biopython
,而关于Python
:
def seq_check(seq, seq1):
if seq1 in seq:
return True
else:
return False
您也可以将它直接放入生成器表达式中:
filtered = (seq for seq in seqs if seq1 in seq)