Python和正则表达式子串

时间:2015-01-10 02:16:09

标签: python regex python-2.7

我试图这样做:

p = re.compile(ur'([A-Z]\w+\s+[A-Z]\w+)|([A-Z]\w+)(?=\s+and\s+[A-Z]\w+\s+([A-Z]\w+))', re.MULTILINE)
test_str = u"Russ Middleton and Lisa Murro\nRon Iervolino, Trish and Russ Middleton, and Lisa Middleton \nRon Iervolino, Kelly  and Tom Murro\nRon Iervolino, Trish and Russ Middleton and Lisa Middleton "
subst = u"$1$2 $3"
result = re.sub(p, subst, test_str)

目标是在必要时获得既能匹配所有名字又填充姓氏的东西(例如,Trish和Russ Middleton成为Trish Middleton和Russ Middleton)。最后,我正在寻找一起出现的名称。

其他人对help me with the regex非常友好,我认为我知道如何用Python编程编写它(虽然我是Python的新手)。无法得到它,我使用了Regex101生成的代码(上面显示的代码)。但是,我进入result的所有内容都是:

u'$1$2 $3 and $1$2 $3\n$1$2 $3, $1$2 $3 and $1$2 $3, and $1$2 $3 \n$1$2 $3, $1$2 $3  and $1$2 $3\n$1$2 $3, $1$2 $3 and $1$2 $3 and $1$2 $3 '

我在Python和正则表达式中缺少什么?

3 个答案:

答案 0 :(得分:1)

您没有使用subst的正确语法 - 尝试,而不是

subst = r'\1\2 \3'

但是,现在您遇到的问题是比赛中没有三个匹配的组。

具体做法是:

>>> for x in p.finditer(test_str): print(x.groups())
... 
('Russ Middleton', None, None)
('Lisa Murro', None, None)
('Ron Iervolino', None, None)
(None, 'Trish', 'Middleton')
('Russ Middleton', None, None)
('Lisa Middleton', None, None)
('Ron Iervolino', None, None)
(None, 'Kelly', 'Murro')
('Tom Murro', None, None)
('Ron Iervolino', None, None)
(None, 'Trish', 'Middleton')
('Russ Middleton', None, None)
('Lisa Middleton', None, None)

每当您在此处看到None时,尝试在替换中插入相应的组(\1等)将会出错。

功能可以更灵活:

>>> def mysub(mo):
...   return '{}{} {}'.format(
...     mo.group(1) or '',
...     mo.group(2) or '',
...     mo.group(3) or '')
... 
>>> result = re.sub(p, mysub, test_str)
>>> result
'Russ Middleton  and Lisa Murro \nRon Iervolino , Trish Middleton and Russ Middleton , and Lisa Middleton  \nRon Iervolino , Kelly Murro  and Tom Murro \nRon Iervolino , Trish Middleton and Russ Middleton  and Lisa Middleton  '

在这里,我已经编码mysub来做我怀疑你想到一个带有组号的替换字符串会为你做的事 - 使用一个空字符串,其中一个组不匹配(即,相应的mo.group(...)None)。

答案 1 :(得分:0)

我建议你一个简单的解决方案。

import re
string = """Russ Middleton and Lisa Murro
Ron Iervolino, Trish and Russ Middleton, and Lisa Middleton 
Ron Iervolino, Kelly  and Tom Murro
Ron Iervolino, Trish and Russ Middleton and Lisa Middleton """
m = re.sub(r'(?<=,\s)([A-Z]\w+)(?=\s+and\s+[A-Z]\w+\s+([A-Z]\w+))', r'\1 \2', string)
print(m)

输出:

Russ Middleton and Lisa Murro
Ron Iervolino, Trish Middleton and Russ Middleton, and Lisa Middleton 
Ron Iervolino, Kelly Murro  and Tom Murro
Ron Iervolino, Trish Middleton and Russ Middleton and Lisa Middleton

DEMO

import regex
string = """Russ Middleton and Lisa Murro
Ron Iervolino, Trish and Russ Middleton, and Lisa Middleton 
Ron Iervolino, Kelly  and Tom Murro
Ron Iervolino, Trish and Russ Middleton and Lisa Middleton 
Trish and Russ Middleton"""
m = regex.sub(r'(?<!\b[A-Z]\w+\s)([A-Z]\w+)(?=\s+and\s+[A-Z]\w+\s+([A-Z]\w+))', r'\1 \2', string)
print(m)

输出:

Russ Middleton and Lisa Murro
Ron Iervolino, Trish Middleton and Russ Middleton, and Lisa Middleton 
Ron Iervolino, Kelly Murro  and Tom Murro
Ron Iervolino, Trish Middleton and Russ Middleton and Lisa Middleton 
Trish Middleton and Russ Middleton

答案 2 :(得分:0)

亚历克斯:我看到你对这些团体的看法。那对我来说并没有发生。谢谢!

我采取了一种新的(ish)方法。这似乎有效。有什么想法吗?

p = re.compile(ur'([A-Z]\w+\s+[A-Z]\w+)|([A-Z]\w+)(?=\s+and\s+[A-Z]\w+\s+([A-Z]\w+))', re.MULTILINE)
temp_result = p.findall(s)
joiner = " ".join
out = [joiner(words).strip() for words in temp_result]

以下是一些输入数据:

test_data = ['John Smith, Barri Lieberman, Nancy Drew','Carter Bays and Craig Thomas','John Smith and Carter Bays',
                     'Jena Silverman, John Silverman, Tess Silverman, and Dara Silverman', 'Tess and Dara Silverman',
                     'Nancy Drew, John Smith, and Daniel Murphy', 'Jonny Podell']

我将上面的代码放在一个函数中,这样我就可以在列表中的每个项目上调用它。在上面的列表中调用它,我得到(从函数)输出:

['John Smith', 'Barri Lieberman', 'Nancy Drew']
['Carter Bays', 'Craig Thomas']
['John Smith', 'Carter Bays']
['Jena Silverman', 'John Silverman', 'Tess Silverman', 'Dara Silverman']
['Tess Silverman', 'Dara Silverman']
['Nancy Drew', 'John Smith', 'Daniel Murphy']
['Jonny Podell']