Python regex - 字符串

时间:2015-05-21 04:54:02

标签: python regex parsing output pager

我正在尝试自学python,而且我对解析概念还很陌生。我正在尝试解析我的消防服务寻呼机的输出,它似乎遵循如下一致的模式:

(UNIT1, UNIT2, UNIT3) 911-STRU (Box# 12345) aBusiness 12345 Street aTown (Xstr CrossStreet1/CrossStreet2) building fire, persons reported #F123456

似乎每个部分都使用()括号分隔,字段按如下方式分解

(Responded trucks) CallSource-JobClassification (Box number if available) Building Name, Building Address (Cross streets) Description of job #JobNumber

报废,在写这篇文章时接到电话。如果没有提供框号,则完全跳过该部分,这意味着它直接进入地址部分,因此我不能指望使用括号进行解析。

所以对那里的解析专家来说,我可以用pyparsing攻击它还是需要一个自定义解析器?此外,我可以使用解析器来定位特定部分,因此它们出现的顺序无关紧要,就像Box#是可选字段的情况一样吗?

我的目标是获取此输入,通过解析整理,然后通过Twitter,短信,电子邮件或上述所有内容发送。

非常感谢提前

编辑:

我有99%的人使用以下代码

import re

sInput = ('(UNIT123, UNIT1234) AMB-MED APPLE HEADQUARTERS 1 INFINITE LOOP CUPERTINO. (XStr DE ANZA BLVD/MARIANI AVE) .42YOM CARDIAC ARREST. #F9876543')

#sInput = '(UNIT123, UNIT1234) ALARM-SPRNKLR (Alarm Type MANUAL/SMOKE) (Box 12345) APPLE HEADQUARTERS 1 INFINITE LOOP CUPERTINO. (XStr DE ANZA BLVD/MARIANI AVE) #F9876544'

# Matches truck names using the consistent four uppercase letters followed by three - four numbers.
pAppliances = re.findall(r'\w[A-Z]{3}\d[0-9]{2,3}', sInput)

# Matches source and job type using the - as a guide, this section is always proceeded by the trucks on the job
# therefore is always proceeded by a ) and a space. Allows between 3-9 characters either side of the - this is
# to allow such variations as 911-RESC, FAA-AIRCRAFT etc.
pJobSource = re.findall(r'\) ([A-Za-z1-9]{2,8}-[A-Za-z1-9]{2,8})', sInput)

# Gets address by starting at (but ignoring) the job source e.g. -RESC and capturing everything until the next . period
# the end of the address section always has a period. Uses ?; to ignore up to two sets of brackets that may appear in
# the string for things such as box numbers or alarm types.

pAddress = re.findall(r'-[A-Z1-9]{2,8} (.*?)\. \(', sInput)
pAddressOptionTwo = re.findall(r'-[A-Z1-9]{2,8}(?: \(.*?\))(?: \(.*?\)) (.*?)\. \(', sInput)

# Finds the specified cross streets as they are always within () brackets, each bracket has a space immediately
# before or after and the work XStr is always present.
pCrossStreet = re.findall(r' \((XStr.*?)\) ', sInput)

# The job details / description is always contained between two . periods e.g.  .42YOM CARDIAC ARREST.  each period
# has a space either immediately before or after.
pJobDetails = re.findall(r' \.(.*?)\. ', sInput)

# Job number is always in the format #F followed by seven digits.  The # is always proceeded by a space.  Allowed
# between 1 and 8 digits for future proofing.
pJobNumber = re.findall(r' (#F\d{0,7})', sInput)

print pAppliances
print pJobSource
print pAddress
print pCrossStreet
print pJobDetails
print pJobNumber

在未注释的sInput字符串上运行时,它返回以下

['UNIT123', 'UNIT1234']
['AMB-MED']
['APPLE HEADQUARTERS 1 INFINITE LOOP CUPERTINO']
['XStr DE ANZA BLVD/MARIANI AVE']
['42YOM CARDIAC ARREST']
['#F9876543']

但是,当我在注释的sInput字符串上运行它时,我得到以下内容

['UNIT123', 'UNIT1234']
['ALARM-SPRNKLR']
['(Alarm Type MANUAL/SMOKE) (Box 12345) APPLE HEADQUARTERS 1 INFINITE LOOP CUPERTINO']
['XStr DE ANZA BLVD/MARIANI AVE']
[]
['#F9876544']

这是因为此消息中包含两个选项括号集。我设法使用pAddressOptionTwo行来纠正这个问题,但是当第一个字符串被应用时,它根本没有返回任何地址,因为它没有找到括号。

所以新的重新聚焦的问题是:

如何在regex行中创建可选参数。如果存在括号,则忽略它们及其内容并返回字符串的其余部分,如果没有括号,则按正常方式继续。

2 个答案:

答案 0 :(得分:2)

我认为您最好/最简单的选择是使用regular expressions,定义一个与输入字符串的全部或部分匹配的模式,并提取您感兴趣的部分。

PyParsing也可能会正常工作。我自己没有使用它,但前几个例子看起来像是一种高级正则表达式的高级包装,虽然我希望一旦你深入研究它就会在很多方面有所不同。

另一种选择是定义lexer并使用PLY从中创建解析器。但是,这可能对你的用例来说太过分了,因为它更多的是解析编程语言和自然语言语法。

答案 1 :(得分:0)

如果你知道pyparsing,那么它可能会更容易。 ()始终可视为可选。 Pyparsing将使某些事情更容易开箱即用。

如果你不熟悉pyparsing,你的主要目标是学习python,那么就用纯python手工制作你自己的解析器。学习一门新语言比重新发明一些轮子更好: - )