如何为re.findall添加额外的标准... Python 2.7?

时间:2013-03-12 19:31:00

标签: python regex findall

ORF_sequences = re.findall(r'ATG(?:...){9,}?(?:TAA|TAG|TGA)',sequence)  #thanks to @Martin Pieters and @nneonneo

我有一行代码,找到A | G后跟2个字符的任何实例,然后是ATG,然后以3为单位读取TAA | TAG | TGA,只有在A | G-时才有效xx-ATG-xxx-TAA | TAG | TGA为30个元素或更多

我想添加一个标准

我需要ATG跟随G

所以A | G-xx-ATG-Gxx-xxx-TAA | TGA | TAG#至少30个元素     例:     GCCATGGGGTTTTTTTTTTTTTTTTTTTTTTTTTGA     ^会工作

GCATGAGGTTTTTTTTTTTTTTTTTTTTTTTTTGA
^ would not work because it is an (A|G) followed by only one value (not 2) before the ATG and there is not a G following the A|G-xx-ATG

我希望这是有道理的

我试过

ORF_sequences = re.findall(r'ATGG(?:...){9,}?(?:TAA|TAG|TGA)',sequence)

但似乎在ATGG的最后一个G之后使用了窗口大小3

基本上我需要那个代码,第一次出现是A | G-xx-ATG,第二次出现是(G-xx)

2 个答案:

答案 0 :(得分:1)

如果您使用[AG]的字符组,则会更容易,不需要将这两个“免费”字符分组:

 ORF_sequences2 = re.findall(r'[AG]..ATG(?:...)*?(?:TAA|TAG|TGA)',fdna)

或者您需要对A|G

进行分组
 ORF_sequences2 = re.findall(r'(?:A|G)..ATG(?:...)*?(?:TAA|TAG|TGA)',fdna)

将第一个表单应用于您的示例:

>>> re.findall(r'[AG]..ATG(?:...)*?(?:TAA|TAG|TGA)', 'GCCATGGGGTTTTGA')
['GCCATGGGGTTTTGA']
>>> re.findall(r'[AG]..ATG(?:...)*?(?:TAA|TAG|TGA)', 'GCATGGGGTTTTGA')
[]

在您的尝试中,表达式与A或表达式G(?:..)ATG(?:...)*?(?:TAA|TAG|TGA)匹配,因为|符号适用于之前或之后的所有同一组。因为它没有分组,所以它适用于整个表达式:

>>> re.findall(r'A|G(?:..)ATG(?:...)*?(?:TAA|TAG|TGA)', 'A')
['A']
>>> re.findall(r'A|G(?:..)ATG(?:...)*?(?:TAA|TAG|TGA)', 'GCCATGGGGTTTTGA')
['GCCATGGGGTTTTGA']

如果您需要在整场比赛中匹配一定数量的字符,则需要定制这3个字符(?:...)组以匹配最少次数:

 ORF_sequences2 = re.findall(r'[AG]..ATGG..(?:...){7,}?(?:TAA|TAG|TGA)',fdna)

将匹配AG后跟2个字符,然后ATGG匹配另外2个字符,然后至少 7次3个字符(共21个) ),然后是另外3个(TAATAGTGA)的特定模式,从第一个字符到最后一个字符总共至少33个字符。额外的..构成了ATG之后3的模式,并与您的评论中的示例匹配:

>>> re.findall(r'[AG]..ATGG..(?:...){7,}?(?:TAA|TAG|TGA)', 'GCCATGGGGTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTGA')
['GCCATGGGGTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTGA']

以及正确处理问题中给出的示例:

>>> re.findall(r'[AG]..ATGG..(?:...){7,}?(?:TAA|TAG|TGA)', 'GCCATGGGGTTTTTTTTTTTTTTTTTTTTTTTTTGA')
['GCCATGGGGTTTTTTTTTTTTTTTTTTTTTTTTTGA']
>>> re.findall(r'[AG]..ATGG..(?:...){7,}?(?:TAA|TAG|TGA)', 'GCATGAGGTTTTTTTTTTTTTTTTTTTTTTTTTGA')
[]

答案 1 :(得分:1)

为确保您获得至少30个字符,请使用{n,}量词:

r'[AG]..ATG(?:...){9,}?(?:TAA|TAG|TGA)'

这确保您在ATG开口和TAA | TGA | TAG终结器之间读取至少9个三元组(27个字符)。