如何在序列中找到三个字母?

时间:2015-02-12 15:31:29

标签: python

我的序列如下:

my_file_m= "TCCATTCTCTACCCAGCCCCCACTCTGACCCCTTTACTCTGACCCCTTTATTGTCTACTCCTCAGAGCCCCCAGTCTGTA
TCCTTCTAACTTAGAAAGGGGATTATGGCTCAGGGTCCAACTCTGTGCTCAGAGCTTTCAACAACTACTCAGAAACACAA
GATGCTGGGACAGTGACCTGGACTGTGGGCCTCTCATGCACCACCATCAAGGACTCAAATGGGCTTTCCGAATTCACTGG
AGCCTCGAATGTCCATTCCTGAGTTCTGCAAAGGGAGAGTGGTCAGGTTGCCTCTGTCTCAGAATGAGGCTGGATAAGAT"

我想找到具体三个字母的位置和数量,TAATGATAG。如果有的话,我想给它们上色。

我开始加载字母

my_file = open(my_file_m)
mine = my_file.read()
print(mine)

我无法使用.count也没有使用find,因为我有三个输入。有什么想法如何找到并突出显示它们?

3 个答案:

答案 0 :(得分:4)

使用标准库中的re.findall函数和collection.Counter

import re
from collections import Counter

pat = re.compile(r"(TAA|TGA|TAG)")
c = re.findall(pat,my_file_m)

print(c)
print(Counter(c))

输出

['TGA', 'TGA', 'TAA', 'TAG', 'TGA', 'TGA', 'TGA', 'TAA']
Counter({'TGA': 5, 'TAA': 2, 'TAG': 1})

答案 1 :(得分:4)

以下是我的问题解决方案:

注意:此代码还可以找到重叠序列。根据您是否要允许重叠,您必须删除'?='

import re 

class bcolors:
    HEADER = '\033[95m'
    OKBLUE = '\033[94m'
    OKGREEN = '\033[92m'
    WARNING = '\033[93m'
    FAIL = '\033[91m'
    ENDC = '\033[0m'
    BOLD = '\033[1m'
    UNDERLINE = '\033[4m'

my_file_m= '''TTCCATTCTCTACCCAGCCCCCACTCTGACCCCTTTACTCTGACCCCTTTATTGTCTACTCCTCAGAGCCCCCAGTCTGTATCCTTCTAACTTAGAAAGGGGATTATGGCTCAGGGTCCAACTCTGTGCTCAGAGCTTTCAACAACTACTCAGAAACACAAGATGCTGGGACAGTGACCTGGACTGTGGGCCTCTCATGCACCACCATCAAGGACTCAAATGGGCTTTCCGAATTCACTGGAGCCTCGAATGTCCATTCCTGAGTTCTGCAAAGGGAGAGTGGTCAGGTTGCCTCTGTCTCAGAATGAGGCTGGATAAGAT'''


pat = re.compile(r'(?=(TAA|AAT|TGA|TAG))') # Very important, if you do not need overlaps then remove '?='
matches = re.finditer(pat,my_file_m)
result1 = [int(match.start(1)) for match in matches] # find all the starting positions of the string
result2 = [range(x,x+3) for x in result1 ] # find all the positions of the characters (given that we search for patterns of length 3, can be modified for other lengths too )
result3 = set().union(*result2) # generate a union

for chari in range(len(my_file_m)): # colorize based on if it is in a sequence or not
    if(chari in result3):
        print bcolors.OKGREEN + my_file_m[chari]  + bcolors.ENDC,
    else:
        print my_file_m[chari],

清洁剂:

import re 
import sys

my_file_m= '''TAATTCCATTCTCTACCCAGCCCCCACTCTGACCCCTTTACTCTGACCCCTTTATTGTCTACTCCTCAGAGCCCCCAGTCTGTATCCTTCTAACTTAGAAAGGGGATTATGGCTCAGGGTCCAACTCTGTGCTCAGAGCTTTCAACAACTACTCAGAAACACAAGATGCTGGGACAGTGACCTGGACTGTGGGCCTCTCATGCACCACCATCAAGGACTCAAATGGGCTTTCCGAATTCACTGGAGCCTCGAATGTCCATTCCTGAGTTCTGCAAAGGGAGAGTGGTCAGGTTGCCTCTGTCTCAGAATGAGGCTGGATAAGAT'''

pat = re.compile(r'(?=(TAA|TGA|TAG))') # Very important, if you do not need overlaps then remove '?='
lettersToColor = set().union(*[range(m.start(1),m.start(1)+3) for m in re.finditer(pat, my_file_m)])

for chari in range(len(my_file_m)): # colorize based on if it is in a sequence or not
    if(chari in lettersToColor):
        sys.stdout.write('\033[92m' + my_file_m[chari]  +'\033[0m')
    else:
        sys.stdout.write(my_file_m[chari])

感谢:herehere

输出: enter image description here

答案 2 :(得分:0)

您是否需要每三个字母拆分DNA序列以映射遗传密码?

如果是,请参阅以下代码。

my_file_m= '''TCCATTCTCTACCCAGCCCCCACTCTGACCCCTTTACTCTGACCCCTTTATTGTCTACTCCTCAGAGCCCCCAGTCTGTA
TCCTTCTAACTTAGAAAGGGGATTATGGCTCAGGGTCCAACTCTGTGCTCAGAGCTTTCAACAACTACTCAGAAACACAA
GATGCTGGGACAGTGACCTGGACTGTGGGCCTCTCATGCACCACCATCAAGGACTCAAATGGGCTTTCCGAATTCACTGG
AGCCTCGAATGTCCATTCCTGAGTTCTGCAAAGGGAGAGTGGTCAGGTTGCCTCTGTCTCAGAATGAGGCTGGATAAGAT'''

mm = "".join(my_file_m.split())                 # delete the new line characters

messenger = map(''.join, zip(*[iter(mm)]*3))    # split every three letters

print messenger.count('TAA')
print messenger.count('TGA')
print messenger.count('TAG')

<强>输出

0
1
0