用于分析COBOL代码的正则表达式

时间:2013-12-10 17:33:07

标签: c# regex split cobol

我目前正在编写一个分析COBOL代码的工具。为此我需要一个正则表达式来分隔单个单词,而我在正则表达式上很糟糕。

我发现以下内容适用于大多数情况,但不是全部。

string[] words = Regex.Split(line, @"[^\p{L}]*\p{Z}[^\p{L}]*");

这个问题是它正在使用像ARG-1这样的字段而只返回ARG。它也没有将像MY-TABLE(WS-INDEX)这样的东西分成MY-TABLE和WS-INDEX。任何帮助我指向正确方向的人都会非常感激。

更新

感谢所有的见解。我完成了我想要的东西:

string[] words = Regex.Split(line, @"\s+");

然后我使用Contains()方法进一步检查单个单词,看看它们中是否有任何表项,例如。

MY-TEST-TABLE(WS-INDEX)

如果他们这样做我将它们子串起来得到2件。

谢谢大家。

2 个答案:

答案 0 :(得分:2)

杰夫A;

http://sourceforge.net/p/open-cobol/code/HEAD/tree/trunk/gnu-cobol/cobc处查看GNU Cobol解析代码,以http://sourceforge.net/p/open-cobol/code/HEAD/tree/trunk/gnu-cobol/cobc/scanner.l开头

该目录,尤其是.l词汇文件包含一些正则表达式,但与.y bison文件结合使用时支持正式语法的上下文。

或者,为了更快速地立即反馈,请尝试将http://koopa.sourceforge.net/船上的Koopa Cobol Parser作为.jar。

或者,对于一个穷人来说,Pygments COBOL语法高亮显示在bitgucket.org下的Pygments / lexers / compiled.py

class CobolLexer(RegexLexer):
    """
    Lexer for GNU Cobol code.

    *New in Pygments 1.6.*
    """
    name = 'COBOL'
    aliases = ['cobol']
    filenames = ['*.cob', '*.COB', '*.cpy', '*.CPY']
    mimetypes = ['text/x-cobol']
    flags = re.IGNORECASE | re.MULTILINE

    # Data Types: by PICTURE and USAGE
    # Operators: **, *, +, -, /, <, >, <=, >=, =, <>
    # Logical (?): NOT, AND, OR

    # Reserved words:
    # http://opencobol.add1tocobol.com/gnucobol/#reserved-words
    # Intrinsics:
    # http://opencobol.add1tocobol.com/gnucobol/#does-gnu-cobol-implement-any-intrinsic-functions

    tokens = {
        'root': [
            include('comment'),
            include('strings'),
            include('core'),
            include('nums'),
            (r'[a-z0-9]([_a-z0-9\-]*[a-z0-9]+)?', Name.Variable),
    #       (r'[\s]+', Text),
            (r'[ \t]+', Text),
        ],
        'comment': [
            (r'(^.{6}[*/].*\n|^.{6}|\*>.*\n)', Comment),
        ],
        'core': [
            # Figurative constants
            #(r'(^|(?<=[^0-9a-z_\-]))(ALL\s+)?'
            (r'\b(?!-)(ALL\s+)?'
             r'((ZEROES)|(HIGH-VALUE|LOW-VALUE|NULL|QUOTE|SPACE|ZERO)(S)?)'
             r'\b(?!-)',
             #r'\s*($|(?=[^0-9a-z_\-]))',
             Name.Constant),

            # Reserved words STATEMENTS and other bolds
            #(r'(^|(?<=[^0-9a-z_\-]))'
             (r'\b(?!-)'
             r'(ACCEPT|ADD|ALLOCATE|CALL|CANCEL|CLOSE|COMPUTE|'
             r'CONFIGURATION|CONTINUE|'
             r'DATA|DELETE|DISPLAY|DIVIDE|DIVISION|ELSE|END|END-ACCEPT|'
             r'END-ADD|END-CALL|END-COMPUTE|END-DELETE|END-DISPLAY|'
             r'END-DIVIDE|END-EVALUATE|END-IF|END-MULTIPLY|END-OF-PAGE|'
             r'END-PERFORM|END-READ|END-RETURN|END-REWRITE|END-SEARCH|'
             r'END-START|END-STRING|END-SUBTRACT|END-UNSTRING|END-WRITE|'
             r'ENVIRONMENT|EVALUATE|EXIT|FD|FILE|FILE-CONTROL|FOREVER|'
             r'FREE|FUNCTION-ID|GENERATE|GO|GOBACK|'
             r'IDENTIFICATION|IF|INITIALIZE|'
             r'INITIATE|INPUT-OUTPUT|INSPECT|INVOKE|I-O-CONTROL|LINKAGE|'
             r'LOCAL-STORAGE|MERGE|MOVE|MULTIPLY|OPEN|'
             r'PERFORM|PROCEDURE|PROGRAM-ID|RAISE|READ|RELEASE|RESUME|'
             r'RETURN|REWRITE|SCREEN|'
             r'SD|SEARCH|SECTION|SET|SORT|START|STOP|STRING|SUBTRACT|'
             r'SUPPRESS|TERMINATE|THEN|UNLOCK|UNSTRING|USE|VALIDATE|'
             r'WORKING-STORAGE|WRITE)'
             r'\b(?!-)', Keyword.Reserved),
             #r'\s*($|(?=[^0-9a-z_\-]))', Keyword.Reserved),

            # Reserved words
            #(r'(^|(?<=[^0-9a-z_\-]))'
            (r'\b(?!-)'
             r'(ACCESS|ADDRESS|ADVANCING|AFTER|ALL|'
             r'ALPHABET|ALPHABETIC|ALPHABETIC-LOWER|ALPHABETIC-UPPER|'
             r'ALPHANUMERIC|ALPHANUMERIC-EDITED|ALSO|ALTER|ALTERNATE|'
             r'ANY|ARE|AREA|AREAS|ARGUMENT-NUMBER|ARGUMENT-VALUE|AS|'
             r'ASCENDING|ASSIGN|AT|AUTO|AUTO-SKIP|AUTOMATIC|AUTOTERMINATE|'
             r'BACKGROUND-COLOR|BASED|BEEP|BEFORE|BELL|'
             r'BLANK|'
             r'BLINK|BLOCK|BOTTOM|BY|BYTE-LENGTH|CHAINING|'
             r'CHARACTER|CHARACTERS|CLASS|CODE|CODE-SET|COL|COLLATING|'
             r'COLS|COLUMN|COLUMNS|COMMA|COMMAND-LINE|COMMIT|COMMON|'
             r'CONSTANT|CONTAINS|CONTENT|CONTROL|'
             r'CONTROLS|CONVERTING|COPY|CORR|CORRESPONDING|COUNT|CRT|'
             r'CURRENCY|CURSOR|CYCLE|DATE|DAY|DAY-OF-WEEK|DE|DEBUGGING|'
             r'DECIMAL-POINT|DECLARATIVES|DEFAULT|DELIMITED|'
             r'DELIMITER|DEPENDING|DESCENDING|DETAIL|DISK|'
             r'DOWN|DUPLICATES|DYNAMIC|EBCDIC|'
             r'ENTRY|ENVIRONMENT-NAME|ENVIRONMENT-VALUE|EOL|EOP|'
             r'EOS|ERASE|ERROR|ESCAPE|EXCEPTION|'
             r'EXCLUSIVE|EXTEND|EXTERNAL|'
             r'FILE-ID|FILLER|FINAL|FIRST|FIXED|'
             r'FOOTING|FOR|FOREGROUND-COLOR|FORMAT|FROM|FULL|FUNCTION|'
             r'GIVING|GLOBAL|GROUP|'
             r'HEADING|HIGHLIGHT|I-O|ID|'
             r'IGNORE|IGNORING|IN|INDEX|INDEXED|INDICATE|'
             r'INITIAL|INITIALIZED|INPUT|'
             r'INTO|INTRINSIC|INVALID|IS|JUST|JUSTIFIED|'
             r'KEY|KEYBOARD|LABEL|'
             r'LAST|LEADING|LEFT|LENGTH|LIMIT|LIMITS|LINAGE|'
             r'LINAGE-COUNTER|LINE|LINES|LOCALE|LOCK|'
             r'LOWLIGHT|MANUAL|MEMORY|MINUS|MODE|'
             r'MULTIPLE|NATIONAL|NATIONAL-EDITED|NATIVE|'
             r'NEGATIVE|NEXT|NO|NUMBER|NUMBERS|NUMERIC|'
             r'NUMERIC-EDITED|OBJECT-COMPUTER|OCCURS|OF|OFF|OMITTED|ON|ONLY|'
             r'OPTIONAL|ORDER|ORGANIZATION|OTHER|OUTPUT|OVERFLOW|'
             r'OVERLINE|PACKED-DECIMAL|PADDING|PAGE|PARAGRAPH|'
             r'PLUS|POSITION|POSITIVE|PRESENT|PREVIOUS|'
             r'PRINTER|PRINTING|PROCEDURES|'
             r'PROCEED|PROGRAM|PROMPT|QUOTE|'
             r'QUOTES|RANDOM|RD|RECORD|RECORDING|RECORDS|RECURSIVE|'
             r'REDEFINES|REEL|REFERENCE|RELATIVE|REMAINDER|REMOVAL|'
             r'RENAMES|REPLACING|REPORT|REPORTING|REPORTS|REPOSITORY|'
             r'REQUIRED|RESERVE|RETURNING|REVERSE-VIDEO|REWIND|'
             r'RIGHT|ROLLBACK|ROUNDED|RUN|SAME|SCROLL|'
             r'SECURE|SEGMENT-LIMIT|SELECT|SENTENCE|SEPARATE|'
             r'SEQUENCE|SEQUENTIAL|SHARING|SIGN|SIGNED|SIGNED-INT|'
             r'SIGNED-LONG|SIGNED-SHORT|SIZE|SORT-MERGE|SOURCE|'
             r'SOURCE-COMPUTER|SPECIAL-NAMES|STANDARD|'
             r'STANDARD-1|STANDARD-2|STATUS|SUM|'
             r'SYMBOLIC|SYNC|SYNCHRONIZED|TALLYING|TAPE|'
             r'TEST|THROUGH|THRU|TIME|TIMES|TO|TOP|TRAILING|'
             r'TRANSFORM|TYPE|UNDERLINE|UNIT|UNSIGNED|'
             r'UNSIGNED-INT|UNSIGNED-LONG|UNSIGNED-SHORT|UNTIL|UP|'
             r'UPDATE|UPON|USAGE|USING|VALUE|VALUES|VARYING|WAIT|WHEN|'
             r'WITH|WORDS|YYYYDDD|YYYYMMDD)'
             r'\b(?!-)', Keyword.Pseudo),
             #r'\s*($|(?=[^0-9a-z_\-]))', Keyword.Pseudo),

            # inactive reserved words
            #(r'(^|(?<=[^0-9a-z_\-]))'
            (r'\b(?!-)'
             r'(ACTIVE-CLASS|ALIGNED|ANYCASE|ARITHMETIC|ATTRIBUTE|B-AND|'
             r'B-NOT|B-OR|B-XOR|BIT|BOOLEAN|CD|CENTER|CF|CH|CHAIN|CLASS-ID|'
             r'CLASSIFICATION|COMMUNICATION|CONDITION|DATA-POINTER|'
             r'DESTINATION|DISABLE|EC|EGI|EMI|ENABLE|END-RECEIVE|'
             r'ENTRY-CONVENTION|EO|ESI|EXCEPTION-OBJECT|EXPANDS|FACTORY|'
             r'FLOAT-BINARY-16|FLOAT-BINARY-34|FLOAT-BINARY-7|'
             r'FORMAT|'
             r'GET|GROUP-USAGE|IMPLEMENTS|INFINITY|'
             r'INHERITS|INTERFACE|INTERFACE-ID|INVOKE|LC_ALL|LC_COLLATE|'
             r'LC_CTYPE|LC_MESSAGES|LC_MONETARY|LC_NUMERIC|LC_TIME|'
             r'LINE-COUNTER|MESSAGE|METHOD|METHOD-ID|NESTED|NONE|NORMAL|'
             r'OBJECT|OBJECT-REFERENCE|OPTIONS|OVERRIDE|PAGE-COUNTER|PF|PH|'
             r'PROPERTY|PROTOTYPE|PURGE|QUEUE|RAISE|RAISING|RECEIVE|'
             r'RELATION|REPLACE|REPRESENTS-NOT-A-NUMBER|RESET|RESUME|RETRY|'
             r'RF|RH|SECONDS|SEGMENT|SELF|SEND|SOURCES|STATEMENT|STEP|'
             r'STRONG|SUB-QUEUE-1|SUB-QUEUE-2|SUB-QUEUE-3|SUPER|SYMBOL|'
             r'SYSTEM-DEFAULT|TABLE|TERMINAL|TEXT|TYPEDEF|UCS-4|UNIVERSAL|'
             r'USER-DEFAULT|UTF-16|UTF-8|VAL-STATUS|VALID|VALIDATE|'
             r'VALIDATE-STATUS)\b(?!-)', Comment),
             #r'VALIDATE-STATUS)\s*($|(?=[^0-9a-z_\-]))', Comment),

            # Data Types
            (r'(^|(?<=[^0-9a-z_\-]))'
            #(r'\b(?!-)'
             r'(PIC\s+.+?(?=(\s|\.\s))|PICTURE\s+.+?(?=(\s|\.\s))|'
             r'(COMPUTATIONAL)(-[1-5X])?|(COMP)(-[1-5X])?|'
             r'BINARY-C-LONG|POINTER|PROGRAM-POINTER|'
             r'FUNCTION-POINTER|PROCEDURE-POINTER|'
             r'BINARY-CHAR|BINARY-DOUBLE|BINARY-LONG|BINARY-SHORT|'
             r'FLOAT-SHORT|FLOAT-LONG|FLOAT-DECIMAL-16|FLOAT-DECIMAL-34|'
             r'FLOAT-BINARY-32|FLOAT-BINARY-64|FLOAT-BINARY-128|'
             r'FLOAT-EXTENDED|FLOAT-DECIMAL-7|'
            # r'BINARY)\b(?!-)', Keyword.Type),
             r'BINARY)\s*($|(?=[^0-9a-z_\-]))', Keyword.Type),

            # Operators
            (r'(\*\*|\*|\+|-|/|<=|>=|<|>|==|/=|=)', Operator),

            # (r'(::)', Keyword.Declaration),

            (r'([(),;:&%.])', Punctuation),

            # Intrinsics
            #(r'(^|(?<=[^0-9a-z_\-]))(ABS|ACOS|ANNUITY|ASIN|ATAN|BYTE-LENGTH|'
            (r'\b(?!-)(ABS|ACOS|ANNUITY|ASIN|ATAN|BYTE-LENGTH|'
             r'CHAR|COMBINED-DATETIME|CONCATENATE|COS|CURRENT-DATE|'
             r'DATE-OF-INTEGER|DATE-TO-YYYYMMDD|DAY-OF-INTEGER|DAY-TO-YYYYDDD|'
             r'EXCEPTION-(?:FILE|LOCATION|STATEMENT|STATUS)|EXP10|EXP|E|'
             r'FACTORIAL|FRACTION-PART|INTEGER-OF-(?:DATE|DAY|PART)|INTEGER|'
             r'LENGTH|LOCALE-(?:DATE|TIME(?:-FROM-SECONDS)?)|LOG10|LOG|'
             r'LOWER-CASE|MAX|MEAN|MEDIAN|MIDRANGE|MIN|MOD|NUMVAL(?:-C)?|'
             r'ORD(?:-MAX|-MIN)?|PI|PRESENT-VALUE|RANDOM|RANGE|REM|REVERSE|'
             r'SECONDS-FROM-FORMATTED-TIME|SECONDS-PAST-MIDNIGHT|SIGN|SIN|SQRT|'
             r'STANDARD-DEVIATION|STORED-CHAR-LENGTH|SUBSTITUTE(?:-CASE)?|'
             r'SUM|TAN|TEST-DATE-YYYYMMDD|TEST-DAY-YYYYDDD|TRIM|'
             r'UPPER-CASE|VARIANCE|WHEN-COMPILED|YEAR-TO-YYYY)'
             r'\b(?!-)', Name.Function),
             #r'UPPER-CASE|VARIANCE|WHEN-COMPILED|YEAR-TO-YYYY)\s*'
             #r'($|(?=[^0-9a-z_\-]))', Name.Function),

            # Booleans
            #(r'(^|(?<=[^0-9a-z_\-]))(true|false)\s*($|(?=[^0-9a-z_\-]))', Name.Builtin),
            (r'\b(?!-)(true|false)\b(?!-)', Name.Builtin),
            # Comparing Operators
            #(r'(^|(?<=[^0-9a-z_\-]))(equal|equals|ne|lt|le|gt|ge|'
            # r'greater|less|than|not|and|or)\s*($|(?=[^0-9a-z_\-]))', Operator.Word),
            (r'\b(?!-)(equal|equals|ne|lt|le|gt|ge|'
             r'greater|less|than|not|and|or)\b(?!-)', Operator.Word),
        ],

        # \"[^\"\n]*\"|\'[^\'\n]*\'
        'strings': [
            # apparently strings can be delimited by EOL if they are continued
            # in the next line
            (r'"[^"\n]*("|\n)', String.Double),
            (r"'[^'\n]*('|\n)", String.Single),
        ],

        'nums': [
            #(r'\d+(\s+|\.$|$)', Number.Integer),
            (r'\b(?!-)\d+\b(?!-)', Number.Integer),
            (r'[+-]?\d*\.\d+([eE][-+]?\d+)?', Number.Float),
            (r'[+-]?\d+\.\d*([eE][-+]?\d+)?', Number.Float),
        ],
    }


class CobolFreeformatLexer(CobolLexer):
    """
    Lexer for Free format OpenCOBOL code.

    *New in Pygments 1.6.*
    """
    name = 'COBOLFree'
    aliases = ['cobolfree']
    filenames = ['*.cbl', '*.CBL']
    mimetypes = []
    flags = re.IGNORECASE | re.MULTILINE

    tokens = {
        'comment': [
            (r'(\*>.*\n|^\w*\*.*$)', Comment),
        ],
    }

请原谅死代码评论,摆脱相当多的回溯模式匹配,仍处于测试阶段,尚未致力于bitbucket。这只是源列表中的漂亮颜色,没有智能或正确性

答案 1 :(得分:1)

正则表达式不是分析COBOL语法的正确工具;但是在将输入文本拆分为标记时可以使用它。但即使是这个更简单的任务,单靠Regex还不够。需要额外的逻辑。

根据VS COBOL II grammar Version 1.0.4标识符(他们称之为“字母用户定义的字词”)定义如下:

  

([0-9] + [ - ] [0-9]的 [A-ZA-Z] [A-ZA-Z0-9] ( [ - ] + [A-ZA-Z0-9] +)*

这个定义很复杂,因为它确保标识符至少包含一个字母。对于拆分,可以删除此要求。如果你这样做,你会得到这个标识符的简单表达式:

  

[0-9A-ZA-Z] +( - [0-9A-ZA-Z])*

为了在拆分时保留分隔符,只需将分隔符放入捕获组(“(”和“)”之间):

string input = "MY-TABLE(WS-INDEX)";
string[] parts = Regex.Split(input, "([0-9A-Za-z]+(-[0-9A-Za-z])*)");

结果将是(没有引号):

  

“”
  “MY-TABLE”
  “(”
  “WS-INDEX”
  “)”


注意

许多语言语法具有递归定义的嵌套结构。此外,它们具有注释和字符串转义等特殊规则,这使得解析非常困难。正则表达式可以解析这样的结构(参见Regular Expression Recursion and Matching Balanced Constructs)但是Regex表达式变得非常复杂并且很难理解,因为你必须将要解析的语言的整个语法压缩成一个单一的Regex表达式。就像你试图将C#应用程序编写为单个语句一样。使用工具专用工具,例如Irony - .NET Language Implementation KitCoco/R