
时间:2011-08-10 12:20:54

标签: python regex

这是此问题的后续问题和复杂问题:Extracting contents of a string within parentheses

在那个问题中,我有以下字符串 -

"Will Farrell (Nick Hasley), Rebecca Hall (Samantha)"

我想以(actor, character) -

[('Will Farrell', 'Nick Hasley'), ('Rebecca Hall', 'Samantha')]

为了概括问题,我有一个稍微复杂的字符串,我需要提取相同的信息。我的字符串是 -

"Will Ferrell (Nick Halsey), Rebecca Hall (Samantha), Glenn Howerton (Gary), 
with Stephen Root and Laura Dern (Delilah)"


[('Will Farrell', 'Nick Hasley'), ('Rebecca Hall', 'Samantha'), ('Glenn Howerton', 'Gary'),
('Stephen Root',''), ('Lauren Dern', 'Delilah')]

我知道我可以替换填充词(with,and,&,等),但不能完全弄清楚如何添加空白条目 - '' - 如果没有字符演员的名字(在这种情况下是斯蒂芬根)。这样做最好的方法是什么?


"Will Ferrell (Nick Halsey), Rebecca Hall (Samantha), Glenn Howerton (Gary, Brad), with
 Stephen Root and Laura Dern (Delilah, Stacy)"


[('Will Farrell', 'Nick Hasley'), ('Rebecca Hall', 'Samantha'), ('Glenn Howerton', 'Gary'),    
 ('Glenn Howerton', 'Brad'), ('Stephen Root',''), ('Lauren Dern', 'Delilah'), ('Lauren Dern', 'Stacy')]


4 个答案:

答案 0 :(得分:4)

import re
credits = """Will Ferrell (Nick Halsey), Rebecca Hall (Samantha), Glenn Howerton (Gary, Brad), with
 Stephen Root and Laura Dern (Delilah, Stacy)"""

# split on commas (only if outside of parentheses), "with" or "and"
splitre = re.compile(r"\s*(?:,(?![^()]*\))|\bwith\b|\band\b)\s*")

# match the part before the parentheses (1) and what's inside the parens (2)
# (only if parentheses are present)
matchre = re.compile(r"([^(]*)(?:\(([^)]*)\))?")

# split the parts inside the parentheses on commas
splitparts = re.compile(r"\s*,\s*")

characters = splitre.split(credits)
pairs = []
for character in characters:
    if character:
        match = matchre.match(character)
        if match:
            actor = match.group(1).strip()
            if match.group(2):
                parts = splitparts.split(match.group(2))
                for part in parts:
                    pairs.append((actor, part))
                pairs.append((actor, ""))



[('Will Ferrell', 'Nick Halsey'), ('Rebecca Hall', 'Samantha'), 
 ('Glenn Howerton', 'Gary'), ('Glenn Howerton', 'Brad'), ('Stephen Root', ''), 
 ('Laura Dern', 'Delilah'), ('Laura Dern', 'Stacy')]

答案 1 :(得分:1)

Tim Pietzcker的解决方案可以简化为(注意模式也被修改):

import re
credits = """   Will Ferrell (Nick Halsey), Rebecca Hall (Samantha), Glenn Howerton (Gary, Brad), with
 Stephen Root and Laura Dern (Delilah, Stacy)"""

# split on commas (only if outside of parentheses), "with" or "and"
splitre = re.compile(r"(?:,(?![^()]*\))(?:\s*with)*|\bwith\b|\band\b)\s*")

# match the part before the parentheses (1) and what's inside the parens (2)
# (only if parentheses are present)
matchre = re.compile(r"\s*([^(]*)(?<! )\s*(?:\(([^)]*)\))?")

# split the parts inside the parentheses on commas
splitparts = re.compile(r"\s*,\s*")

pairs = []
for character in splitre.split(credits):
    gr = matchre.match(character).groups('')
    for part in splitparts.split(gr[1]):
        pairs.append((gr[0], part))



import re
credits = """   Will Ferrell (Nick Halsey), Rebecca Hall (Samantha), Glenn Howerton (Gary, Brad), with
 Stephen Root and Laura Dern (Delilah, Stacy)"""

# split on commas (only if outside of parentheses), "with" or "and"
splitre = re.compile(r"(?:,(?![^()]*\))(?:\s*with)*|\bwith\b|\band\b)\s*")

# match the part before the parentheses (1) and what's inside the parens (2)
# (only if parentheses are present)
matchre = re.compile(r"\s*([^(]*)(?<! )\s*(?:\(([^)]*)\))?")

# split the parts inside the parentheses on commas
splitparts = re.compile(r"\s*,\s*")

gen = (matchre.match(character).groups('') for character in splitre.split(credits))

pp = [ (gr[0], part) for gr in gen for part in splitparts.split(gr[1])]

print pp



答案 2 :(得分:0)

你想要的是识别以大写字母开头的单词序列,加上一些复杂功能(恕我直言,你不能假设每个名字都由名字姓氏组成,而且名称姓氏小名,或姓名M.姓氏,或其他本地化的变异,Jean-Claude van Damme,Louis da Silva等。)。



import nltk
from nltk.chunk.regexp import RegexpParser

_patterns = [
    (r'^[A-Z][a-zA-Z]*[A-Z]?[a-zA-Z]+.?$', 'NNP'),  # proper nouns
    (r'^[(]$', 'O'),
    (r'[,]', 'COMMA'),
    (r'^[)]$', 'C'),
    (r'.+', 'NN')                                   # nouns (default)

_grammar = """
        NAME: {<NNP> <COMMA> <NNP>}
        NAME: {<NNP>+}
        ROLE: {<O> <NAME>+ <C>}
text = "Will Ferrell (Nick Halsey), Rebecca Hall (Samantha), Glenn Howerton (Gary, Brad), with Stephen Root and Laura Dern (Delilah, Stacy)"
tagger = nltk.RegexpTagger(_patterns)    
chunker = RegexpParser(_grammar)
text = text.replace('(', '( ').replace(')', ' )').replace(',', ' , ')
tokens = text.split()
tagged_text = tagger.tag(tokens)
tree = chunker.parse(tagged_text)

for n in tree:
    if isinstance(n, nltk.tree.Tree) and n.node in ['ROLE', 'NAME']: 
        print n

# output is:
# (NAME Will/NNP Ferrell/NNP)
# (ROLE (/O (NAME Nick/NNP Halsey/NNP) )/C)
# (NAME Rebecca/NNP Hall/NNP)
# (ROLE (/O (NAME Samantha/NNP) )/C)
# (NAME Glenn/NNP Howerton/NNP)
# (ROLE (/O (NAME Gary/NNP ,/COMMA Brad/NNP) )/C)
# (NAME Stephen/NNP Root/NNP)
# (NAME Laura/NNP Dern/NNP)
# (ROLE (/O (NAME Delilah/NNP ,/COMMA Stacy/NNP) )/C)




否则,Tim's solution正在为您发布的输入很好地解决问题,并且没有nltk依赖。

答案 3 :(得分:0)


in_string = "Will Ferrell (Nick Halsey), Rebecca Hall (Samantha), Glenn Howerton (Gary, Brad), with Stephen Root and Laura Dern (Delilah, Stacy)"    

in_list = []
is_in_paren = False
item = {}
next_string = ''

index = 0
while index < len(in_string):
    char = in_string[index]  

    if in_string[index:].startswith(' and') and not is_in_paren:
        actor = next_string
        if actor.startswith(' with '):
            actor = actor[6:]
        item['actor'] = actor
        item = {}
        next_string = ''
        index += 4    
    elif char == '(':
        is_in_paren = True
        item['actor'] = next_string
        next_string = ''    
    elif char == ')':
        is_in_paren = False
        item['part'] = next_string
        item = {}                 
        next_string = ''
    elif char == ',':
        if is_in_paren:
            item['part'] = next_string
            next_string = ''
            item = item.copy()
        next_string = "%s%s" % (next_string, char)

    index += 1

out_list = []
for dict in in_list:
    actor = dict.get('actor')
    part = dict.get('part')

    if part is None:
        part = ''

    out_list.append((actor.strip(), part.strip()))

print out_list

输出: [('Will Ferrell','Nick Halsey'),('Rebecca Hall','Samantha'),('Glenn Howerton','Gary'),('Glenn Howerton','Brad'),('Stephen Root) ',''),('Laura Dern','Delilah'),('Laura Dern','Stacy')]