Python RegEx - 如何处理字符串中的可选部分

时间:2015-05-30 07:40:10

标签: python regex pager

这是我目前使用正则表达式解析来自消防部门寻呼机的消息的源代码。除了pAddress行之外,一切正常。

import re

sInput = '(CUPE123, CUPE124, MTVW211, MTVW215, SUNV5326) ALARM-STRUC (Alarm Type THERMAL SMOKE) (Box 12345) APPLE INC - 1 INFINITE LOOP CUPERTINO. (XStr DE ANZA BLVD/MARIANI AVE) .BUILDING FIRE - SMOKE SHOWING - PERSONS REPORTED. #F987654321'

# Matches truck names using the consistent four uppercase letters followed by three - four numbers.
pTrucks = ','.join(re.findall(r'\w[A-Z]{3}\d[0-9]{2,3}', sInput))

# Matches source and job type using the - as a guide, this section is always proceeded by the trucks on the job
# therefore is always proceeded by a ) and a space. Allows between 3-9 characters either side of the - this is
# to allow such variations as 911-RESC, FAA-AIRCRAFT etc.
pJobSource = ''.join(re.findall(r'\) ([A-Za-z1-9]{2,8}-[A-Za-z1-9]{2,8})', sInput))

# Gets address by starting at (but ignoring) the job source e.g. -RESC and capturing everything until the next . period
# the end of the address section always has a period. Uses ?; to ignore up to two sets of brackets that may appear in
# the string for things such as box numbers or alarm types.
pAddress = ''.join(re.findall(r'-[A-Z1-9]{2,8} (.*?)\. \(', sInput))

# Finds the specified cross streets as they are always within () brackets, each bracket has a space immediately
# before or after and the work XStr is always present.
pCrossStreet = ''.join(re.findall(r' \((XStr.*?)\) ', sInput))

# The job details / description is always contained between two . periods e.g.  .42YOM CARDIAC ARREST.  each period
# has a space either immediately before or after.
pJobDetails = ''.join(re.findall(r' \.(.*?)\. ', sInput))

# Job number is always in the format #F followed by seven digits.  The # is always proceeded by a space.  Allowed
# between 1 and 8 digits for future proofing.
pJobNumber = ''.join(re.findall(r' (#F\d{0,7})', sInput))

# Get optional Alarm type which is always presented with a space (Alarm
pAlarmDetails = ''.join(re.findall(r' \((Alarm .*?)\) ', sInput))

# Get optional Box type which is always presented with a space (Box
pBoxDetails = ''.join(re.findall(r' (\(Box .*?\))', sInput))

print "Responding Trucks:  " + pTrucks
print "Job Source / Type:  " + pJobSource
print "Address:            " + pAddress
print "Cross Streets:      " + pCrossStreet
print "Job Details:        " + pJobDetails
print "Additional Info:    " + pAlarmDetails + ", " + pBoxDetails
print "\n\nJob Number:         " + pJobNumber

问题是寻呼机输入有两个可选字段     (报警类型*)和(方框*) 取决于工作,可能存在,不存在或两者的组合。目前的代码将返回

Responding Trucks:  CUPE123,CUPE124,MTVW211,MTVW215,SUNV5326
Job Source / Type:  ALARM-STRUC
Address:            (Alarm Type THERMAL SMOKE) (Box 12345) APPLE INC - 1 INFINITE LOOP CUPERTINO
Cross Streets:      XStr DE ANZA BLVD/MARIANI AVE
Job Details:        BUILDING FIRE - SMOKE SHOWING - PERSONS REPORTED
Additional Info:    Alarm Type THERMAL SMOKE, (Box 12345)


Job Number:         #F9876543

一切都很完美,除了地址线,它还引入了警报类型和Box#。

如何修改RegEx以便将(警报类型)和(框)字段视为可选字段?我已经从另一个SO线程尝试了这个,它与当前的sinput字符串完美配合。

pAddress = ''.join(re.findall(r'-[A-Z1-9]{2,8}(?: \(Alarm .*?\))(?: \(Box .*\)) (.*?)\. \(', sInput))

返回

Responding Trucks:  CUPE123,CUPE124,MTVW211,MTVW215,SUNV5326
Job Source / Type:  ALARM-STRUC
Address:            APPLE INC - 1 INFINITE LOOP CUPERTINO
Cross Streets:      XStr DE ANZA BLVD/MARIANI AVE
Job Details:        BUILDING FIRE - SMOKE SHOWING - PERSONS REPORTED
Additional Info:    Alarm Type THERMAL SMOKE, (Box 12345)


Job Number:         #F9876543

这是完美的和我期望的结果,但是,当我将sInput字符串更改为既不包含(报警类型*)或(框*)

sInput = '(CUPE123, CUPE124, MTVW211, MTVW215, SUNV5326) ALARM-STRUC APPLE INC - 1 INFINITE LOOP CUPERTINO. (XStr DE ANZA BLVD/MARIANI AVE) .BUILDING FIRE - SMOKE SHOWING - PERSONS REPORTED. #F987654321'

输出然后在地址字段

中不返回任何内容
Responding Trucks:  CUPE123,CUPE124,MTVW211,MTVW215,SUNV5326
Job Source / Type:  ALARM-STRUC
Address:            
Cross Streets:      XStr DE ANZA BLVD/MARIANI AVE
Job Details:        BUILDING FIRE - SMOKE SHOWING - PERSONS REPORTED
Additional Info:    , 


Job Number:         #F9876543

我觉得我非常接近,只是遗漏了一些东西......对于这篇长篇文章感到抱歉,可能会有点TMI。

TL; DR如何修改pAddress变量的RegEx以忽略(Alarm Type *)和(Box *)字段,无论它们是否存在?

1 个答案:

答案 0 :(得分:4)

您只需要向两个非捕获组添加?(零或一个匹配)量词。

-[A-Z1-9]{2,8}(?: \(Alarm .*?\))?(?: \(Box .*\))? (.*?)\. \(

现在,无论Alarm TypeBox是否存在,它都应该有效。

DEMO