我有一个类似于以下内容的数据集:
Capture MICR - Serial: Pos44: Trrt: 32904 Acct: Tc: 2064 Opt4: Split:
我遇到的问题是我无法弄清楚如何为捕获MICR - 串行字段"正确编写捕获。此字段可以是空白的,也可以包含不同长度的字母数字(我对其他可填充或空白的字段也有同样的问题。
我已经尝试了以下的一些变体,但我仍然很短。
pp.Literal("Capture MICR - Serial:") + pp.White(" ", min=1, max=0) + (pp.Word(pp.printables) ^ pp.White(" ", min=1, max=0))("crd_micr_serial") + pp.FollowedBy(pp.Literal("Pos44:"))
我认为问题的一部分是Or
匹配最长匹配的解析,在这种情况下可能是一个长的空白字符,只有一个字母数字,但我仍然希望捕获单一价值。
感谢大家的帮助。
答案 0 :(得分:1)
这样做你想要的吗?
我仅使用Combine
,以便Or
的两个臂都会产生类似的结果,即在结果字符串的末尾使用'Pos44:'可以将其拉出。我不喜欢诉诸正则表达式。
>>> import pyparsing as pp
>>> record_A = 'Capture MICR - Serial: Pos44: Trrt: 32904 Acct: Tc: 2064 Opt4: Split:'
>>> record_B = 'Capture MICR - Serial: 76ZXP67 Pos44: Trrt: 32904 Acct: Tc: 2064 Opt4: Split:'
>>> parser_fragment = pp.Combine(pp.White()+pp.Literal('Pos44:'))
>>> parser = pp.Literal('Capture MICR - Serial:')+pp.Or([parser_fragment,pp.Regex('.*?(?:Pos44\:)')])
>>> parser.parseString(record_A)
(['Capture MICR - Serial:', ' Pos44:'], {})
>>> parser.parseString(record_B)
(['Capture MICR - Serial:', '76ZXP67 Pos44:'], {})
答案 1 :(得分:1)
解析文本的最简单方法,如" A:valueA B:valueB C:valueC"是使用pyparsing的SkipTo类:
a_expr = "A:" + SkipTo("B:")
b_expr = "B:" + SkipTo("C:")
c_expr = "C:" + SkipTo(LineEnd())
line_parser = a_expr + b_expr + c_expr
我想再提高一点:
添加解析操作以去除前导空格和尾随空格
添加结果名称,以便在解析行后轻松获取结果
以下是该简单解析器的外观:
NL = LineEnd()
a_expr = "A:" + SkipTo("B:").addParseAction(lambda t: [t[0].strip()])('A')
b_expr = "B:" + SkipTo("C:").addParseAction(lambda t: [t[0].strip()])('B')
c_expr = "C:" + SkipTo(NL).addParseAction(lambda t: [t[0].strip()])('C')
line_parser = a_expr + b_expr + c_expr
line_parser.runTests("""
A: 100 B: Fred C:
A: B: a value with spaces C: 42
""")
给出:
A: 100 B: Fred C:
['A:', '100', 'B:', 'Fred', 'C:', '']
- A: '100'
- B: 'Fred'
- C: ''
A: B: a value with spaces C: 42
['A:', '', 'B:', 'a value with spaces', 'C:', '42']
- A: ''
- B: 'a value with spaces'
- C: '42'
我尽量避免复制/粘贴代码,而宁愿自动化" A后面跟着B"和 " C之后是行尾"使用一个描述不同提示字符串的列表,然后遍历该列表来构建每个提示字符串 子表达式:
import pyparsing as pp
def make_prompt_expr(s):
'''Define the expression for prompts as 'ABC:' '''
return pp.Combine(pp.Literal(s) + ':')
def make_field_value_expr(next_expr):
'''Define the expression for the field value as SkipTo(what comes next)'''
return pp.SkipTo(next_expr).addParseAction(lambda t: [t[0].strip()])
def make_name(s):
'''Convert prompt string to identifier form for results names'''
return ''.join(s.split()).replace('-','_')
# use split to easily define list of prompts in order - makes it easy to update later if new prompts are added
prompts = "Capture MICR - Serial/Pos44/Trrt/Acct/Tc/Opt4/Split".split('/')
# keep a list of all the prompt-value expressions
exprs = []
# get a list of this-prompt, next-prompt pairs
for this_, next_ in zip(prompts, prompts[1:] + [None]):
field_name = make_name(this_)
if next_ is not None:
next_expr = make_prompt_expr(next_)
else:
next_expr = pp.LineEnd()
# define the prompt-value expression for the current prompt string and add to exprs
this_expr = make_prompt_expr(this_) + make_field_value_expr(next_expr)(field_name)
exprs.append(this_expr)
# define a line parser as the And of all of the generated exprs
line_parser = pp.And(exprs)
line_parser.runTests("""\
Capture MICR - Serial: Pos44: Trrt: 32904 Acct: Tc: 2064 Opt4: Split:
Capture MICR - Serial: 1729XYZ Pos44: Trrt: 32904 Acct: Tc: 2064 Opt4: XXL Split: 50
""")
给出:
Capture MICR - Serial: Pos44: Trrt: 32904 Acct: Tc: 2064 Opt4: Split:
['Capture MICR - Serial:', '', 'Pos44:', '', 'Trrt:', '32904', 'Acct:', '', 'Tc:', '2064', 'Opt4:', '', 'Split:', '']
- Acct: ''
- CaptureMICR_Serial: ''
- Opt4: ''
- Pos44: ''
- Split: ''
- Tc: '2064'
- Trrt: '32904'
Capture MICR - Serial: 1729XYZ Pos44: Trrt: 32904 Acct: Tc: 2064 Opt4: XXL Split: 50
['Capture MICR - Serial:', '1729XYZ', 'Pos44:', '', 'Trrt:', '32904', 'Acct:', '', 'Tc:', '2064', 'Opt4:', 'XXL', 'Split:', '50']
- Acct: ''
- CaptureMICR_Serial: '1729XYZ'
- Opt4: 'XXL'
- Pos44: ''
- Split: '50'
- Tc: '2064'
- Trrt: '32904'