Question

I am trying to extract a doctor's name and title from a string. If "dr" is in the string, I want it to use that as the title and then use the next word as the doctor's name. However, I also want the regex to be compatible with strings that do not have "dr" in them. In that case, it should just match the first word as the doctor's name and assume no title.

I have come up with the following regex pattern:

pattern = re.compile('(DR\.? )?([A-Z]*)', re.IGNORECASE)

As I understand it, this should optionally match the letters "dr" (with or without a following period) and then a space, followed by a series of letters, case-insensitive. The problem is, it seems to only pick up the optional "dr" title if it is at the beginning of the string.

import re
pattern = re.compile('(DR\.? )?([A-Z]*)', re.IGNORECASE)
test1 = "Dr Joseph Fox"
test2 = "Joseph Fox"
test3 = "Optometry by Dr Joseph Fox"
print pattern.search(test1).groups()
print pattern.search(test2).groups()
print pattern.search(test3).groups()

The code returns this:

('Dr ', 'Joseph')
(None, 'Joseph')
(None, 'Optometry')

The first two scenarios make sense to me, but why does the third not find the optional "Dr"? Is there a way to make this work?

Answer 1

You're seeing this behavior because regexes tend to be greedy and accept the first possible match. As a result, your regex is accepting only the first word of your third string, with no characters matching the first group, which is optional. You can see this by using the findall regex function:

>>> print pattern.findall(test3)
[('', 'Optometry'), ('', ''), ('', 'by'), ('', ''), ('Dr ', 'Joseph'), ('', ''), ('', 'Fox'), ('', '')]

It's immediately obvious that 'Dr Joseph' was successfully found, but just wasn't the first matching part of your string.

In my experience, trying to coerce regexes to express/capture multiple cases is often asking for inscrutable regexes. Specifically answering your question, I'd prefer to run the string through one regex requiring the 'Dr' title, and if I fail to get any matches, just split on spaces and take the first word (or however you want to go about getting the first word).

Answer 2

Regular expression engines match greedily from left to right. In other words: there is no "best" match and the first match will always be returned. You can do a global search, though...check out re.findall().

Answer 3

Your regex basically accepts any word, therefore it will be difficult to choose which one is the name of the doctor even after using findall if the dr is not present.

Is the re.IGNORECASE really important? Are you only interested in the name of the doctor or both name and surname?

I would reccomend using a regex that matches two words starting with uppercase and only one space in between, maintaining the optional dr before.

If re.ignorecase is really important, maybe it is better to make first a search for dr, and if it is unsuccessful, then store the first word as the name or something like that as proposed before

Answer 4

Look for (?<=...) syntax: Python Regex

Your re pattern will look about like this:

(DR\.? )?(?<=DR\.? )([A-Z]*)

Answer 5

You are only looking for Dr when the string starts with it, you aren't searching for a string containing Dr.

try pattern = re.compile('(.*DR\.? )?([A-Z]*)', re.IGNORECASE)

Python Regex Skipping Optional Groups

5 个答案: