I am searching for a given pattern in a string, and I don't want finditer to return the first position of the match but a given position in the match. Both the pattern and the position to be returned can vary. For example a pattern can be 'ATTAAT' and in this case I want finditer to return the second position of the match. The simple way to do it is just to add 2 to whatever position I get:
import re
seq = 'GGGGATTAATCCCATTAATTAATCCC'
for m in re.finditer('ATTAAT', seq):
print m.start() + 2
This returns the position 6 and 15, and it is perfectly fine.
But my problem is that I need to search for several patterns at the same time (e.g. ATTAAT and GATC) and for each of these pattern I would need to return a different relative position (for instance in the first case I would need to return the position +2, and in the second, the position +1).
I found that this could be solved by using positive look-behind assertion, and writing something like:
re.finditer('((?<=AT)TAAT)|((?<=G)ATC)', seq))
The problem is that if I apply it on my (simpler) first example, I got a different result:
import re
seq = 'GGGGATTAATCCCATTAATTAATCCC'
for m in re.finditer('(?<=AT)TAAT', seq):
print m.start()
This finds me a match in positions 6, 15 and 19. I assume that this is because finditer does not consume the same amount of characters because of the look-behind assertion.
In order to solve this issue, I could parse my array of results and remove the ones that are overlapping, but it seems very inefficient, and I was wondering if someone had a better idea... even using a completely different strategy.
EDIT
I found a solution using regex:
import re
seq = 'GGGGATTAATCCCATTAATTAATCCC'
for m in re.finditer('(AT)(TAAT)', seq):
print m.start(2)
I preferred using look-behind assertion, but I still don't know how to make finditer to consume the full string.