ORF_sequences = re.findall(r'ATG(?:...){9,}?(?:TAA|TAG|TGA)',sequence) #thanks to @Martin Pieters and @nneonneo
我有一行代码,找到A | G后跟2个字符的任何实例,然后是ATG,然后以3为单位读取TAA | TAG | TGA,只有在A | G-时才有效xx-ATG-xxx-TAA | TAG | TGA为30个元素或更多
我想添加一个标准
我需要ATG跟随G
所以A | G-xx-ATG-Gxx-xxx-TAA | TGA | TAG#至少30个元素 例: GCCATGGGGTTTTTTTTTTTTTTTTTTTTTTTTTGA ^会工作
GCATGAGGTTTTTTTTTTTTTTTTTTTTTTTTTGA
^ would not work because it is an (A|G) followed by only one value (not 2) before the ATG and there is not a G following the A|G-xx-ATG
我希望这是有道理的
我试过
ORF_sequences = re.findall(r'ATGG(?:...){9,}?(?:TAA|TAG|TGA)',sequence)
但似乎在ATGG的最后一个G之后使用了窗口大小3
基本上我需要那个代码,第一次出现是A | G-xx-ATG,第二次出现是(G-xx)
答案 0 :(得分:1)
如果您使用[AG]
的字符组,则会更容易,不需要将这两个“免费”字符分组:
ORF_sequences2 = re.findall(r'[AG]..ATG(?:...)*?(?:TAA|TAG|TGA)',fdna)
或者您需要对A|G
:
ORF_sequences2 = re.findall(r'(?:A|G)..ATG(?:...)*?(?:TAA|TAG|TGA)',fdna)
将第一个表单应用于您的示例:
>>> re.findall(r'[AG]..ATG(?:...)*?(?:TAA|TAG|TGA)', 'GCCATGGGGTTTTGA')
['GCCATGGGGTTTTGA']
>>> re.findall(r'[AG]..ATG(?:...)*?(?:TAA|TAG|TGA)', 'GCATGGGGTTTTGA')
[]
在您的尝试中,表达式与A
或表达式G(?:..)ATG(?:...)*?(?:TAA|TAG|TGA)
匹配,因为|
符号适用于之前或之后的所有同一组。因为它没有分组,所以它适用于整个表达式:
>>> re.findall(r'A|G(?:..)ATG(?:...)*?(?:TAA|TAG|TGA)', 'A')
['A']
>>> re.findall(r'A|G(?:..)ATG(?:...)*?(?:TAA|TAG|TGA)', 'GCCATGGGGTTTTGA')
['GCCATGGGGTTTTGA']
如果您需要在整场比赛中匹配一定数量的字符,则需要定制这3个字符(?:...)
组以匹配最少次数:
ORF_sequences2 = re.findall(r'[AG]..ATGG..(?:...){7,}?(?:TAA|TAG|TGA)',fdna)
将匹配A
或G
后跟2个字符,然后ATGG
匹配另外2个字符,然后至少 7次3个字符(共21个) ),然后是另外3个(TAA
,TAG
或TGA
)的特定模式,从第一个字符到最后一个字符总共至少33个字符。额外的..
构成了ATG
之后3的模式,并与您的评论中的示例匹配:
>>> re.findall(r'[AG]..ATGG..(?:...){7,}?(?:TAA|TAG|TGA)', 'GCCATGGGGTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTGA')
['GCCATGGGGTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTGA']
以及正确处理问题中给出的示例:
>>> re.findall(r'[AG]..ATGG..(?:...){7,}?(?:TAA|TAG|TGA)', 'GCCATGGGGTTTTTTTTTTTTTTTTTTTTTTTTTGA')
['GCCATGGGGTTTTTTTTTTTTTTTTTTTTTTTTTGA']
>>> re.findall(r'[AG]..ATGG..(?:...){7,}?(?:TAA|TAG|TGA)', 'GCATGAGGTTTTTTTTTTTTTTTTTTTTTTTTTGA')
[]
答案 1 :(得分:1)
为确保您获得至少30个字符,请使用{n,}
量词:
r'[AG]..ATG(?:...){9,}?(?:TAA|TAG|TGA)'
这确保您在ATG开口和TAA | TGA | TAG终结器之间读取至少9个三元组(27个字符)。