Question

我有以下字符串：

string = "Will Ferrell (Nick Halsey), Rebecca Hall (Samantha), Michael Pena (Frank Garcia)"

我想以[(actor_name, character_name),...]的形式创建一个元组列表，如下所示：

[(Will Ferrell, Nick Halsey), (Rebecca Hall, Samantha), (Michael Pena, Frank Garcia)]

我目前正在使用hack-ish方法执行此操作，通过(标记拆分然后使用.rstrip（'（'），如下所示：

for item in string.split(','):
    item.rstrip(')').split('(')

有没有更好，更强大的方法来做到这一点？谢谢。

Answer 1

string = "Will Ferrell (Nick Halsey), Rebecca Hall (Samantha), Michael Pena (Frank Garcia)"

import re
pat = re.compile(r'([^(]+)\s*\(([^)]+)\)\s*(?:,\s*|$)')

lst = [(t[0].strip(), t[1].strip()) for t in pat.findall(string)]

编译后的模式有点棘手。这是一个原始的字符串，使反斜杠不那么疯狂。这意味着：开始一个匹配组;匹配任何不是'（'字符，任何次数，只要它至少一次;关闭匹配组;匹配文字'（'字符;开始另一个匹配组;匹配任何不是'）'字符，任何次数，只要它至少一次;关闭匹配组;匹配文字'）'字符;然后匹配任何空白区域（包括没有空格）;然后真的很棘手。真正棘手的部分是不形成匹配组的分组。它不是以'（'和''结尾'开头，而是以“（？：”开头，然后以'）结尾。我使用了这个分组，所以我可以放入一个垂直条以允许两个备用模式：逗号匹配后跟任意数量的空格，或者到达行的末尾（'$'字符）。

然后我使用pat.findall()查找模式匹配的string内的所有位置;它会自动返回元组。我把它放在列表理解中，并在每个项目上调用.strip()来清理空格。

当然，我们可以让正则表达式更加复杂，并让它返回已经剥离了空格的名称。然而，正则表达式会使真正毛茸茸，所以我们将使用Python正则表达式中最酷的一个功能：“详细”模式，您可以在多个行中拼出模式并根据需要添加注释。我们使用原始的三引号字符串，因此反斜杠很方便，多行很方便。你走了：

import re
s_pat = r'''
\s*  # any amount of white space
([^( \t]  # start match group; match one char that is not a '(' or space or tab
[^(]*  # match any number of non '(' characters
[^( \t])  # match one char that is not a '(' or space or tab; close match group
\s*  # any amount of white space
\(  # match an actual required '(' char (not in any match group)
\s*  # any amount of white space
([^) \t]  # start match group; match one char that is not a ')' or space or tab
[^)]*  # match any number of non ')' characters
[^) \t])  # match one char that is not a ')' or space or tab; close match group
\s*  # any amount of white space
\) # match an actual required ')' char (not in any match group)
\s*  # any amount of white space
(?:,|$)  # non-match group: either a comma or the end of a line
'''
pat = re.compile(s_pat, re.VERBOSE)

lst = pat.findall(string)

男人，这真的不值得努力。

此外，上面保留了名称中的空白区域。通过分割空白区域并重新加入空格，您可以轻松地对空白区域进行标准化，以确保它是100％一致的。

string = '  Will   Ferrell  ( Nick\tHalsey ) , Rebecca Hall (Samantha), Michael\fPena (Frank Garcia)'

import re
pat = re.compile(r'([^(]+)\s*\(([^)]+)\)\s*(?:,\s*|$)')

def nws(s):
    """normalize white space.  Replaces all runs of white space by a single space."""
    return " ".join(w for w in s.split())

lst = [tuple(nws(item) for item in t) for t in pat.findall(string)]

print lst # prints: [('Will Ferrell', 'Nick Halsey'), ('Rebecca Hall', 'Samantha'), ('Michael Pena', 'Frank Garcia')]

现在string有一个愚蠢的空格：多个空格，一个标签，甚至一个换页符（“\ f”）。以上清理它，以便名称由单个空格分隔。

Answer 2

正则表达式的好地方：

>>> import re
>>> pat = "([^,\(]*)\((.*?)\)"
>>> re.findall(pat, "Will Ferrell (Nick Halsey), Rebecca Hall (Samantha), Michael Pena (Frank Garcia)")
[('Will Ferrell ', 'Nick Halsey'), (' Rebecca Hall ', 'Samantha'), (' Michael Pena ', 'Frank Garcia')]

Answer 3

比其他人更明确的答案，我认为它符合您的需求：

import re
regex = re.compile(r'([a-zA-Z]+ [a-zA-Z]+) \(([a-zA-Z]+ [a-zA-Z]+)\)')
actor_character = regex.findall(string)

我承认这有点难看，但就像我说得更明确一样。

在括号内提取字符串的内容

3 个答案: