non-overlapping positive look-behind assertion

时间:2017-03-22 18:49:19

标签: python regex

I am searching for a given pattern in a string, and I don't want finditer to return the first position of the match but a given position in the match. Both the pattern and the position to be returned can vary. For example a pattern can be 'ATTAAT' and in this case I want finditer to return the second position of the match. The simple way to do it is just to add 2 to whatever position I get:

import re
seq = 'GGGGATTAATCCCATTAATTAATCCC'
for m in re.finditer('ATTAAT', seq):
    print m.start() + 2

This returns the position 6 and 15, and it is perfectly fine.

But my problem is that I need to search for several patterns at the same time (e.g. ATTAAT and GATC) and for each of these pattern I would need to return a different relative position (for instance in the first case I would need to return the position +2, and in the second, the position +1).

I found that this could be solved by using positive look-behind assertion, and writing something like:

re.finditer('((?<=AT)TAAT)|((?<=G)ATC)', seq))

The problem is that if I apply it on my (simpler) first example, I got a different result:

import re
seq = 'GGGGATTAATCCCATTAATTAATCCC'
for m in re.finditer('(?<=AT)TAAT', seq):
    print m.start()

This finds me a match in positions 6, 15 and 19. I assume that this is because finditer does not consume the same amount of characters because of the look-behind assertion.

In order to solve this issue, I could parse my array of results and remove the ones that are overlapping, but it seems very inefficient, and I was wondering if someone had a better idea... even using a completely different strategy.


EDIT

I found a solution using regex:

import re
seq = 'GGGGATTAATCCCATTAATTAATCCC'
for m in re.finditer('(AT)(TAAT)', seq):
    print m.start(2)

I preferred using look-behind assertion, but I still don't know how to make finditer to consume the full string.

0 个答案:

没有答案