寻找正则表达式的正确方法,使用不同顺序的组

时间:2017-10-15 01:48:04

标签: python regex

我正在尝试使用python解析许多cobol副本。

我有这个正则表达式,我已经从cobol.py中提供的修改:

^(?P<level>\d{2})\s+(?P<name>\S+).*?
(\s+INDEXED BY\s+(?P<indexed_by>\S+))?.*?
(\s+REDEFINES\s+(?P<redefines>\S+))?.*?
(\s+PIC(TURE)?\s+(?P<pic>\S+))?.*?
(\s+OCCURS\s+(?P<occurs>\d+).?( TIMES)?)?.*?
((?P<comp>)\s+COMP\S+)?.*?
(\s+VALUE\s+(?P<value>\S+).*)?
\.$

以下是适用于除最后一行之外的所有行的文本示例。第二个最后一行未能找到 pic 组匹配,因为发生组先前已经(ahem)发生在字符串中。

03  AMOUNT-BREAKDOWN        PICTURE 9(8)V99  VALUE ZEROES.
03  AMOUNT-BREAKDOWN-X REDEFINES AMOUNT-BREAKDOWN.
05  FILLER              PICTURE X(3)     VALUE "DEC".
03  MONTH REDEFINES MONTH-TAB  PICTURE X(3) OCCURS 12 TIMES.
03  SUB                 PICTURE 99    VALUE 0.
03  NUMBER-HOLD.
05  NUMB-HOLD       PICTURE X  OCCURS 11 TIMES.
05  FILLER              PICTURE X(5)     VALUE "TEN".
03  DIGIT-TAB2 REDEFINES DIGIT-TAB1.
05  DIGIT-TABLE         OCCURS 10   PICTURE X(5).
03  WK-TEN-MILLION          PICTURE X(5)     VALUE SPACES.

我在使用正则表达式时会遇到困难,但我认为我冒着混乱的风险,因为我遗漏了一些基本的东西。

要明确:带有PICTURE语句的所有行都被 pic 组捕获,但最后一行除外,因为它出现在发生捕获组之后。

任何帮助表示感谢。

3 个答案:

答案 0 :(得分:1)

PyParsing(https://github.com/pyparsing/pyparsing)是轻松构建语法的好模块。您可以构建基本的Copybook语法,然后使用PyParsing对其进行解析。然后,您必须发布流程以保留由两位级别字段表示的树状结构。

还要看看使用PyParsing的Copybook包(https://github.com/zalmane/copybook)。

答案 1 :(得分:0)

虽然像PLYparsely这样的实际解析器最适合这个,如果你必须使用正则表达式,你不能只添加另一个具有不同键的OCCURS组吗? e.g。

"""
03  AMOUNT-BREAKDOWN        PICTURE 9(8)V99  VALUE ZEROES.
03  AMOUNT-BREAKDOWN-X REDEFINES AMOUNT-BREAKDOWN.
05  FILLER              PICTURE X(3)     VALUE "DEC".
03  MONTH REDEFINES MONTH-TAB  PICTURE X(3) OCCURS 12 TIMES.
03  SUB                 PICTURE 99    VALUE 0.
03  NUMBER-HOLD.
05  NUMB-HOLD       PICTURE X  OCCURS 11 TIMES.
05  FILLER              PICTURE X(5)     VALUE "TEN".
03  DIGIT-TAB2 REDEFINES DIGIT-TAB1.
05  DIGIT-TABLE         OCCURS 10   PICTURE X(5).
03  WK-TEN-MILLION          PICTURE X(5)     VALUE SPACES.
"""
import re
for line in __doc__.split("\n"):
    if len(line) < 1: continue
    m = re.match(
        "^(?P<level>\d{2})\s+(?P<name>\S+).*?"
        "(\s+INDEXED BY\s+(?P<indexed_by>\S+))?.*?"
        "(\s+REDEFINES\s+(?P<redefines>\S+))?.*?"
        "(\s+OCCURS\s+(?P<occurs1>\d+).?( TIMES)?)?.*?"   # <-- occurs1
        "(\s+PIC(TURE)?\s+(?P<pic>\S+))?.*?"
        "(\s+OCCURS\s+(?P<occurs>\d+).?( TIMES)?)?.*?"
        "((?P<comp>)\s+COMP\S+)?.*?"
        "(\s+VALUE\s+(?P<value>\S+).*)?"
        "\.$", line)
    if m:
        print m.groups()

Try it online!

示例输出:

('03', 'AMOUNT-BREAKDOWN', None, None, None, None, None, None, None, '        PICTURE 9(8)V99', 'TURE', '9(8)V99', None, None, None, None, None, '  VALUE ZEROES', 'ZEROES')
('03', 'AMOUNT-BREAKDOWN-X', None, None, ' REDEFINES AMOUNT-BREAKDOWN', 'AMOUNT-BREAKDOWN', None, None, None, None, None, None, None, None, None, None, None, None, None)
('05', 'FILLER', None, None, None, None, None, None, None, '              PICTURE X(3)', 'TURE', 'X(3)', None, None, None, None, None, '     VALUE "DEC"', '"DEC"')
('03', 'MONTH', None, None, ' REDEFINES MONTH-TAB', 'MONTH-TAB', None, None, None, '  PICTURE X(3)', 'TURE', 'X(3)', ' OCCURS 12 ', '12', None, None, None, None, None)
('03', 'SUB', None, None, None, None, None, None, None, '                 PICTURE 99', 'TURE', '99', None, None, None, None, None, '    VALUE 0', '0')
('03', 'NUMBER-HOLD', None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None)
('05', 'NUMB-HOLD', None, None, None, None, None, None, None, '       PICTURE X', 'TURE', 'X', '  OCCURS 11 ', '11', None, None, None, None, None)
('05', 'FILLER', None, None, None, None, None, None, None, '              PICTURE X(5)', 'TURE', 'X(5)', None, None, None, None, None, '     VALUE "TEN"', '"TEN"')
('03', 'DIGIT-TAB2', None, None, ' REDEFINES DIGIT-TAB1', 'DIGIT-TAB1', None, None, None, None, None, None, None, None, None, None, None, None, None)
('05', 'DIGIT-TABLE', None, None, None, None, '         OCCURS 10 ', '10', None, '  PICTURE X(5)', 'TURE', 'X(5)', None, None, None, None, None, None, None)
('03', 'WK-TEN-MILLION', None, None, None, None, None, None, None, '          PICTURE X(5)', 'TURE', 'X(5)', None, None, None, None, None, '     VALUE SPACES', 'SPACES')

答案 2 :(得分:0)

cb2xml

你应该看看cb2xml。它将解析Cobol Copybook并创建一个Xml文件。然后,您可以在python中处理Xml 或任何语言。 cb2xml 包具有以python +其他语言处理Xml的基本示例。

的Cobol:

   01 Ams-Vendor.
       03 Brand               Pic x(3).
       03 Location-details.
          05 Location-Number  Pic 9(4).
          05 Location-Type    Pic XX.
          05 Location-Name    Pic X(35).
       03 Address-Details.
          05 actual-address.
             10 Address-1     Pic X(40).
             10 Address-2     Pic X(40).
             10 Address-3     Pic X(35).
          05 Postcode         Pic 9(4).
          05 Empty            pic x(6).
          05 State            Pic XXX.
       03 Location-Active     Pic X.

cb2xml的输出:

?xml version="1.0" encoding="UTF-8" standalone="no"?>
<copybook filename="cbl2xml_Test110.cbl">
    <item display-length="173" level="01" name="Ams-Vendor" position="1" storage-length="173">
        <item display-length="3" level="03" name="Brand" picture="x(3)" position="1" storage-length="3"/>
        <item display-length="41" level="03" name="Location-details" position="4" storage-length="41">
            <item display-length="4" level="05" name="Location-Number" numeric="true" picture="9(4)" position="4" storage-length="4"/>
            <item display-length="2" level="05" name="Location-Type" picture="XX" position="8" storage-length="2"/>
            <item display-length="35" level="05" name="Location-Name" picture="X(35)" position="10" storage-length="35"/>
        </item>
        <item display-length="128" level="03" name="Address-Details" position="45" storage-length="128">
            <item display-length="115" level="05" name="actual-address" position="45" storage-length="115">
                <item display-length="40" level="10" name="Address-1" picture="X(40)" position="45" storage-length="40"/>
                <item display-length="40" level="10" name="Address-2" picture="X(40)" position="85" storage-length="40"/>
                <item display-length="35" level="10" name="Address-3" picture="X(35)" position="125" storage-length="35"/>
            </item>
            <item display-length="4" level="05" name="Postcode" numeric="true" picture="9(4)" position="160" storage-length="4"/>
            <item display-length="6" level="05" name="Empty" picture="x(6)" position="164" storage-length="6"/>
            <item display-length="3" level="05" name="State" picture="XXX" position="170" storage-length="3"/>
        </item>
        <item display-length="1" level="03" name="Location-Active" picture="X" position="173" storage-length="1"/>
    </item>
</copybook>                

Dynamically Reading COBOL Redefines with C#

中描述了 cb2xml 的一个有趣应用

CobolToCsv

CobolToCsv包会将Cobol-Data-File转换为Csv文件。限制:

  • 不处理重新定义/多记录文件
  • Cobol编译器支持的范围相当有限(Mainframe,Gnu Cobol,Fujitsu-Cobol)。

Cobol2Csv 应该能够处理文本文件(+ Comp-3)。它可能会处理你的一些文件。