在这个例子中使用什么正则表达式

时间:2014-08-31 00:30:17

标签: python regex

我正在解析一个我知道肯定只包含以下要解析的不同短语的字符串:

'Man of the Match'
'Goal'
'Assist'
'Yellow Card'
'Red Card'

我正在解析的字符串可以包含从上面的所有元素到所有元素的所有内容(即,正在解析的字符串可以是从“无”到“匹配目标协助黄卡红牌”的任何内容。

对于那些了解足球的人,你也会意识到“目标”和“助攻”这些元素在理论上可以重复无数次。 “黄牌”元素也可以重复0次,1次或2次。

我构建了以下正则表达式(其中'incident1'是要解析的字符串),我相信它会返回无限数量的所有前面的Regex,但我得到的只是单个实例:

regex1 = re.compile("Man of the Match*", re.S)
regex2 = re.compile("Goal*", re.S)
regex3 = re.compile("Assist*", re.S)
regex4 = re.compile("Red Card*", re.S)
regex5 = re.compile("Yellow Card*", re.S)

mysearch1 = re.search(regex1, incident1)
mysearch2 = re.search(regex2, incident1)
mysearch3 = re.search(regex3, incident1)
mysearch4 = re.search(regex4, incident1)
mysearch5 = re.search(regex5, incident1)

#print mystring
print "incident1 = ", incident1
if mysearch1 is not None:
    print "Man of the match = ", mysearch1.group()
if mysearch2 is not None:    
    print "Goal = ", mysearch2.group()
if mysearch3 is not None:
    print "Assist = ", mysearch3.group()
if mysearch4 is not None:    
    print "Red Card = ", mysearch4.group()
if mysearch5 is not None:
    print "Yellow Card = ", mysearch5.group()

只要字符串中遇到每个元素只有一个实例,这就有效,但是如果一个玩家例如得分超过一个目标,则此代码只返回一个“目标”实例。

谁能看到我做错了什么?

2 个答案:

答案 0 :(得分:2)

您可以尝试这样的事情:

import re
s = "here's an example Man of the Match match and a Red Card match, and another Red Card match"
patterns = [
    'Man of the Match',
    'Goal',
    'Assist',
    'Yellow Card',
    'Red Card',
]
repattern = '|'.join(patterns)
matches = re.findall(repattern, s, re.IGNORECASE)
print matches # ['Man of the Match', 'Red Card', 'Red Card']

答案 1 :(得分:1)

python中正则表达式方法的一般概述:

re.search | re.match

在您之前的尝试中,您尝试使用re.search。这只返回了一个结果,你会发现这并不罕见。这两个函数用于标识一行是否包含某个正则表达式。你可以将它们用于:

s = subprocess.check_output('ipconfig') # calls ipconfig and sends output to s
for line in s.splitlines():
    if re.search("\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}", str(line)):
        # if line contains an IP address...
        print(line)

使用re.match专门检查正则表达式是否与字符串的BEGINNING匹配。这通常与匹配WHOLE字符串的正则表达式一起使用。例如:

lines = ['Adam Smith, Age: 24, Male, Favorite Thing: Reading page: 16',
         'Adam Smith, Age: 16, Male, Favorite Thing: Being a regex example']
# two Adams, but we only want the one who is 16 years old.
repattern = re.compile(r'''Adam \w+, Age: 16, (?:Male|Female), Favorite Thing: [^,]*?''')
for line in lines:
    if repattern.match(line):
        print(line)
        # Adam Smith, Age: 16, Male, Favorite Thing: Being a regex example
# note if we'd used re.search for Age: 16, it would have found both lines!

带走的是你使用这两个函数来选择较长文档中的行(或任何可迭代的行)

re.findall | re.finditer

在这种情况下,你似乎没有尝试匹配一行,你试图从字符串中提取一些特定格式的信息。让我们看一些例子。

s = """Phone book:
Adam: (555)123-4567
Joe:  (555)987-6543
Alice:(555)135-7924"""

pat = r'''(?:\(\d{3}\))?\d{3}-?\d{4}'''
phone_numbers = re.findall(pat, s)
print(phone_numbers)
# ['(555)123-4567','(555)987-6543','(555)135-7924']

re.finditer返回生成器而不是列表。您使用的方法与在Python2中使用xrange而不是range的方式相同。如果存在TON匹配,则re.findall(some_pattern, some_string)可以创建GIANT列表。 re.finditer不会。

其他方法:re.split |应用re.sub

re.split非常棒,如果您有很多事情需要拆分。想象一下你有这个字符串:

s = '''Hello, world! It's great that you're talking to me, and everything, but I'd really rather you just split me on punctuation marks. Okay?'''

与你习惯的str.split没有什么好方法可以做到这一点,所以改为:

separators = [".", "!", "?", ","]
splitpattern = '|'.join(map(re.escape, separators))
# re.escape takes a string and escapes out any characters that regex considers
#   special, for instance that . would otherwise be "any character"!
split_s = re.split(splitpattern, s)
print(split_s)
# ['Hello', ' world', " It's great that you're talking to me", ' and everything', " but I'd really rather you just split me on punctuation marks", ' Okay', '']

re.sub非常适用于您知道某些内容会定期格式化的情况,但您不确定具体如何。但是,你真的想确保它们的格式都一样!这将是一个先进的,并使用几种方法,但坚持我....

dates = ['08/08/2014', '09-13-2014', '10.10.1997', '9_29_09']
separators = list()
new_sep = "/"
match_pat = re.compile(r'''
    \d{1,2}              # two digits
    (.)                  # followed by a separator (capture)
    \d{1,2}              # two more digits
    \1                   # a backreference to that separator
    \d{2}(?:\d{2})?      # two digits and optionally four digits''', re.X)
for idx,date in enumerate(dates):
    match = match_pat.match(date)
    if match:
        sep = match.group(1) # the separator
        separators.append(sep)
    else:
        dates.pop(idx) # this isn't really a date, is it?
repl_pat = '|'.join(map(re.escape, separators))
final_dates = re.sub(repl_pat, new_sep, '\n'.join(dates))
print(final_dates)
# 08/08/2014
# 09/13/2014
# 10/10/1997
# 9/29/09

稍微不那么高级的示例,您可以将re.sub与任何格式化的表达式一起使用,并将其传递给函数返回!例如:

def get_department(dept_num):
    departments = {'1': 'I.T.',
                   '2': 'Administration',
                   '3': 'Human Resources',
                   '4': 'Maintenance'}
    if hasattr(dept_num, 'group'): # then it's a match, not a number
        dept_num = dept_num.group(0)
    return departments.get(dept_num, "Unknown Dept")

file = r"""Name,Performance Review,Department
Adam,3,1
Joe,5,2
Alice,1,3
Eve,12,4""" # this looks like a csv file

dept_names = re.sub(r'''\d+$''', get_department, file, flags=re.M)
print(dept_names)
# Name,Performance Review,Department
# Adam,3,I.T.
# Joe,5,Administration
# Alice,1,Human Resources
# Eve,12,Maintenance

在这里不使用正则表达式,你可以这样做:

replaced_lines = []
departments = {'1': 'I.T.',
               '2': 'Administration',
               '3': 'Human Resources',
               '4': 'Maintenance'}
for line in file.splitlines():
    the_split_line = line.split(',')
    replaced_lines.append(','.join(the_split_line[:-1]+ \
                                   departments.get(the_split_line[-1], "Unknown Dept")))
new_file = '\n'.join(replaced_lines)
# LOTS OF STRING MANIPULATION, YUCK!

相反,我们用函数和re.sub调用替换所有用于循环和字符串拆分,列表切片和字符串操作的东西。事实上,如果你使用lambda它就更容易了!

departments = {'1': 'I.T.',
               '2': 'Administration',
               '3': 'Human Resources',
               '4': 'Maintenance'}
re.sub(r'''\d+$''', lambda x: departments.get(x, "Unknown Dept"), file, flags=re.M)
# DONE!