我目前正在编写一个分析COBOL代码的工具。为此我需要一个正则表达式来分隔单个单词,而我在正则表达式上很糟糕。
我发现以下内容适用于大多数情况,但不是全部。
string[] words = Regex.Split(line, @"[^\p{L}]*\p{Z}[^\p{L}]*");
这个问题是它正在使用像ARG-1这样的字段而只返回ARG。它也没有将像MY-TABLE(WS-INDEX)这样的东西分成MY-TABLE和WS-INDEX。任何帮助我指向正确方向的人都会非常感激。
更新
感谢所有的见解。我完成了我想要的东西:
string[] words = Regex.Split(line, @"\s+");
然后我使用Contains()方法进一步检查单个单词,看看它们中是否有任何表项,例如。
MY-TEST-TABLE(WS-INDEX)
如果他们这样做我将它们子串起来得到2件。
谢谢大家。
答案 0 :(得分:2)
杰夫A;
在http://sourceforge.net/p/open-cobol/code/HEAD/tree/trunk/gnu-cobol/cobc处查看GNU Cobol解析代码,以http://sourceforge.net/p/open-cobol/code/HEAD/tree/trunk/gnu-cobol/cobc/scanner.l开头
该目录,尤其是.l词汇文件包含一些正则表达式,但与.y bison文件结合使用时支持正式语法的上下文。
或者,为了更快速地立即反馈,请尝试将http://koopa.sourceforge.net/船上的Koopa Cobol Parser作为.jar。
或者,对于一个穷人来说,Pygments COBOL语法高亮显示在bitgucket.org下的Pygments / lexers / compiled.py
class CobolLexer(RegexLexer):
"""
Lexer for GNU Cobol code.
*New in Pygments 1.6.*
"""
name = 'COBOL'
aliases = ['cobol']
filenames = ['*.cob', '*.COB', '*.cpy', '*.CPY']
mimetypes = ['text/x-cobol']
flags = re.IGNORECASE | re.MULTILINE
# Data Types: by PICTURE and USAGE
# Operators: **, *, +, -, /, <, >, <=, >=, =, <>
# Logical (?): NOT, AND, OR
# Reserved words:
# http://opencobol.add1tocobol.com/gnucobol/#reserved-words
# Intrinsics:
# http://opencobol.add1tocobol.com/gnucobol/#does-gnu-cobol-implement-any-intrinsic-functions
tokens = {
'root': [
include('comment'),
include('strings'),
include('core'),
include('nums'),
(r'[a-z0-9]([_a-z0-9\-]*[a-z0-9]+)?', Name.Variable),
# (r'[\s]+', Text),
(r'[ \t]+', Text),
],
'comment': [
(r'(^.{6}[*/].*\n|^.{6}|\*>.*\n)', Comment),
],
'core': [
# Figurative constants
#(r'(^|(?<=[^0-9a-z_\-]))(ALL\s+)?'
(r'\b(?!-)(ALL\s+)?'
r'((ZEROES)|(HIGH-VALUE|LOW-VALUE|NULL|QUOTE|SPACE|ZERO)(S)?)'
r'\b(?!-)',
#r'\s*($|(?=[^0-9a-z_\-]))',
Name.Constant),
# Reserved words STATEMENTS and other bolds
#(r'(^|(?<=[^0-9a-z_\-]))'
(r'\b(?!-)'
r'(ACCEPT|ADD|ALLOCATE|CALL|CANCEL|CLOSE|COMPUTE|'
r'CONFIGURATION|CONTINUE|'
r'DATA|DELETE|DISPLAY|DIVIDE|DIVISION|ELSE|END|END-ACCEPT|'
r'END-ADD|END-CALL|END-COMPUTE|END-DELETE|END-DISPLAY|'
r'END-DIVIDE|END-EVALUATE|END-IF|END-MULTIPLY|END-OF-PAGE|'
r'END-PERFORM|END-READ|END-RETURN|END-REWRITE|END-SEARCH|'
r'END-START|END-STRING|END-SUBTRACT|END-UNSTRING|END-WRITE|'
r'ENVIRONMENT|EVALUATE|EXIT|FD|FILE|FILE-CONTROL|FOREVER|'
r'FREE|FUNCTION-ID|GENERATE|GO|GOBACK|'
r'IDENTIFICATION|IF|INITIALIZE|'
r'INITIATE|INPUT-OUTPUT|INSPECT|INVOKE|I-O-CONTROL|LINKAGE|'
r'LOCAL-STORAGE|MERGE|MOVE|MULTIPLY|OPEN|'
r'PERFORM|PROCEDURE|PROGRAM-ID|RAISE|READ|RELEASE|RESUME|'
r'RETURN|REWRITE|SCREEN|'
r'SD|SEARCH|SECTION|SET|SORT|START|STOP|STRING|SUBTRACT|'
r'SUPPRESS|TERMINATE|THEN|UNLOCK|UNSTRING|USE|VALIDATE|'
r'WORKING-STORAGE|WRITE)'
r'\b(?!-)', Keyword.Reserved),
#r'\s*($|(?=[^0-9a-z_\-]))', Keyword.Reserved),
# Reserved words
#(r'(^|(?<=[^0-9a-z_\-]))'
(r'\b(?!-)'
r'(ACCESS|ADDRESS|ADVANCING|AFTER|ALL|'
r'ALPHABET|ALPHABETIC|ALPHABETIC-LOWER|ALPHABETIC-UPPER|'
r'ALPHANUMERIC|ALPHANUMERIC-EDITED|ALSO|ALTER|ALTERNATE|'
r'ANY|ARE|AREA|AREAS|ARGUMENT-NUMBER|ARGUMENT-VALUE|AS|'
r'ASCENDING|ASSIGN|AT|AUTO|AUTO-SKIP|AUTOMATIC|AUTOTERMINATE|'
r'BACKGROUND-COLOR|BASED|BEEP|BEFORE|BELL|'
r'BLANK|'
r'BLINK|BLOCK|BOTTOM|BY|BYTE-LENGTH|CHAINING|'
r'CHARACTER|CHARACTERS|CLASS|CODE|CODE-SET|COL|COLLATING|'
r'COLS|COLUMN|COLUMNS|COMMA|COMMAND-LINE|COMMIT|COMMON|'
r'CONSTANT|CONTAINS|CONTENT|CONTROL|'
r'CONTROLS|CONVERTING|COPY|CORR|CORRESPONDING|COUNT|CRT|'
r'CURRENCY|CURSOR|CYCLE|DATE|DAY|DAY-OF-WEEK|DE|DEBUGGING|'
r'DECIMAL-POINT|DECLARATIVES|DEFAULT|DELIMITED|'
r'DELIMITER|DEPENDING|DESCENDING|DETAIL|DISK|'
r'DOWN|DUPLICATES|DYNAMIC|EBCDIC|'
r'ENTRY|ENVIRONMENT-NAME|ENVIRONMENT-VALUE|EOL|EOP|'
r'EOS|ERASE|ERROR|ESCAPE|EXCEPTION|'
r'EXCLUSIVE|EXTEND|EXTERNAL|'
r'FILE-ID|FILLER|FINAL|FIRST|FIXED|'
r'FOOTING|FOR|FOREGROUND-COLOR|FORMAT|FROM|FULL|FUNCTION|'
r'GIVING|GLOBAL|GROUP|'
r'HEADING|HIGHLIGHT|I-O|ID|'
r'IGNORE|IGNORING|IN|INDEX|INDEXED|INDICATE|'
r'INITIAL|INITIALIZED|INPUT|'
r'INTO|INTRINSIC|INVALID|IS|JUST|JUSTIFIED|'
r'KEY|KEYBOARD|LABEL|'
r'LAST|LEADING|LEFT|LENGTH|LIMIT|LIMITS|LINAGE|'
r'LINAGE-COUNTER|LINE|LINES|LOCALE|LOCK|'
r'LOWLIGHT|MANUAL|MEMORY|MINUS|MODE|'
r'MULTIPLE|NATIONAL|NATIONAL-EDITED|NATIVE|'
r'NEGATIVE|NEXT|NO|NUMBER|NUMBERS|NUMERIC|'
r'NUMERIC-EDITED|OBJECT-COMPUTER|OCCURS|OF|OFF|OMITTED|ON|ONLY|'
r'OPTIONAL|ORDER|ORGANIZATION|OTHER|OUTPUT|OVERFLOW|'
r'OVERLINE|PACKED-DECIMAL|PADDING|PAGE|PARAGRAPH|'
r'PLUS|POSITION|POSITIVE|PRESENT|PREVIOUS|'
r'PRINTER|PRINTING|PROCEDURES|'
r'PROCEED|PROGRAM|PROMPT|QUOTE|'
r'QUOTES|RANDOM|RD|RECORD|RECORDING|RECORDS|RECURSIVE|'
r'REDEFINES|REEL|REFERENCE|RELATIVE|REMAINDER|REMOVAL|'
r'RENAMES|REPLACING|REPORT|REPORTING|REPORTS|REPOSITORY|'
r'REQUIRED|RESERVE|RETURNING|REVERSE-VIDEO|REWIND|'
r'RIGHT|ROLLBACK|ROUNDED|RUN|SAME|SCROLL|'
r'SECURE|SEGMENT-LIMIT|SELECT|SENTENCE|SEPARATE|'
r'SEQUENCE|SEQUENTIAL|SHARING|SIGN|SIGNED|SIGNED-INT|'
r'SIGNED-LONG|SIGNED-SHORT|SIZE|SORT-MERGE|SOURCE|'
r'SOURCE-COMPUTER|SPECIAL-NAMES|STANDARD|'
r'STANDARD-1|STANDARD-2|STATUS|SUM|'
r'SYMBOLIC|SYNC|SYNCHRONIZED|TALLYING|TAPE|'
r'TEST|THROUGH|THRU|TIME|TIMES|TO|TOP|TRAILING|'
r'TRANSFORM|TYPE|UNDERLINE|UNIT|UNSIGNED|'
r'UNSIGNED-INT|UNSIGNED-LONG|UNSIGNED-SHORT|UNTIL|UP|'
r'UPDATE|UPON|USAGE|USING|VALUE|VALUES|VARYING|WAIT|WHEN|'
r'WITH|WORDS|YYYYDDD|YYYYMMDD)'
r'\b(?!-)', Keyword.Pseudo),
#r'\s*($|(?=[^0-9a-z_\-]))', Keyword.Pseudo),
# inactive reserved words
#(r'(^|(?<=[^0-9a-z_\-]))'
(r'\b(?!-)'
r'(ACTIVE-CLASS|ALIGNED|ANYCASE|ARITHMETIC|ATTRIBUTE|B-AND|'
r'B-NOT|B-OR|B-XOR|BIT|BOOLEAN|CD|CENTER|CF|CH|CHAIN|CLASS-ID|'
r'CLASSIFICATION|COMMUNICATION|CONDITION|DATA-POINTER|'
r'DESTINATION|DISABLE|EC|EGI|EMI|ENABLE|END-RECEIVE|'
r'ENTRY-CONVENTION|EO|ESI|EXCEPTION-OBJECT|EXPANDS|FACTORY|'
r'FLOAT-BINARY-16|FLOAT-BINARY-34|FLOAT-BINARY-7|'
r'FORMAT|'
r'GET|GROUP-USAGE|IMPLEMENTS|INFINITY|'
r'INHERITS|INTERFACE|INTERFACE-ID|INVOKE|LC_ALL|LC_COLLATE|'
r'LC_CTYPE|LC_MESSAGES|LC_MONETARY|LC_NUMERIC|LC_TIME|'
r'LINE-COUNTER|MESSAGE|METHOD|METHOD-ID|NESTED|NONE|NORMAL|'
r'OBJECT|OBJECT-REFERENCE|OPTIONS|OVERRIDE|PAGE-COUNTER|PF|PH|'
r'PROPERTY|PROTOTYPE|PURGE|QUEUE|RAISE|RAISING|RECEIVE|'
r'RELATION|REPLACE|REPRESENTS-NOT-A-NUMBER|RESET|RESUME|RETRY|'
r'RF|RH|SECONDS|SEGMENT|SELF|SEND|SOURCES|STATEMENT|STEP|'
r'STRONG|SUB-QUEUE-1|SUB-QUEUE-2|SUB-QUEUE-3|SUPER|SYMBOL|'
r'SYSTEM-DEFAULT|TABLE|TERMINAL|TEXT|TYPEDEF|UCS-4|UNIVERSAL|'
r'USER-DEFAULT|UTF-16|UTF-8|VAL-STATUS|VALID|VALIDATE|'
r'VALIDATE-STATUS)\b(?!-)', Comment),
#r'VALIDATE-STATUS)\s*($|(?=[^0-9a-z_\-]))', Comment),
# Data Types
(r'(^|(?<=[^0-9a-z_\-]))'
#(r'\b(?!-)'
r'(PIC\s+.+?(?=(\s|\.\s))|PICTURE\s+.+?(?=(\s|\.\s))|'
r'(COMPUTATIONAL)(-[1-5X])?|(COMP)(-[1-5X])?|'
r'BINARY-C-LONG|POINTER|PROGRAM-POINTER|'
r'FUNCTION-POINTER|PROCEDURE-POINTER|'
r'BINARY-CHAR|BINARY-DOUBLE|BINARY-LONG|BINARY-SHORT|'
r'FLOAT-SHORT|FLOAT-LONG|FLOAT-DECIMAL-16|FLOAT-DECIMAL-34|'
r'FLOAT-BINARY-32|FLOAT-BINARY-64|FLOAT-BINARY-128|'
r'FLOAT-EXTENDED|FLOAT-DECIMAL-7|'
# r'BINARY)\b(?!-)', Keyword.Type),
r'BINARY)\s*($|(?=[^0-9a-z_\-]))', Keyword.Type),
# Operators
(r'(\*\*|\*|\+|-|/|<=|>=|<|>|==|/=|=)', Operator),
# (r'(::)', Keyword.Declaration),
(r'([(),;:&%.])', Punctuation),
# Intrinsics
#(r'(^|(?<=[^0-9a-z_\-]))(ABS|ACOS|ANNUITY|ASIN|ATAN|BYTE-LENGTH|'
(r'\b(?!-)(ABS|ACOS|ANNUITY|ASIN|ATAN|BYTE-LENGTH|'
r'CHAR|COMBINED-DATETIME|CONCATENATE|COS|CURRENT-DATE|'
r'DATE-OF-INTEGER|DATE-TO-YYYYMMDD|DAY-OF-INTEGER|DAY-TO-YYYYDDD|'
r'EXCEPTION-(?:FILE|LOCATION|STATEMENT|STATUS)|EXP10|EXP|E|'
r'FACTORIAL|FRACTION-PART|INTEGER-OF-(?:DATE|DAY|PART)|INTEGER|'
r'LENGTH|LOCALE-(?:DATE|TIME(?:-FROM-SECONDS)?)|LOG10|LOG|'
r'LOWER-CASE|MAX|MEAN|MEDIAN|MIDRANGE|MIN|MOD|NUMVAL(?:-C)?|'
r'ORD(?:-MAX|-MIN)?|PI|PRESENT-VALUE|RANDOM|RANGE|REM|REVERSE|'
r'SECONDS-FROM-FORMATTED-TIME|SECONDS-PAST-MIDNIGHT|SIGN|SIN|SQRT|'
r'STANDARD-DEVIATION|STORED-CHAR-LENGTH|SUBSTITUTE(?:-CASE)?|'
r'SUM|TAN|TEST-DATE-YYYYMMDD|TEST-DAY-YYYYDDD|TRIM|'
r'UPPER-CASE|VARIANCE|WHEN-COMPILED|YEAR-TO-YYYY)'
r'\b(?!-)', Name.Function),
#r'UPPER-CASE|VARIANCE|WHEN-COMPILED|YEAR-TO-YYYY)\s*'
#r'($|(?=[^0-9a-z_\-]))', Name.Function),
# Booleans
#(r'(^|(?<=[^0-9a-z_\-]))(true|false)\s*($|(?=[^0-9a-z_\-]))', Name.Builtin),
(r'\b(?!-)(true|false)\b(?!-)', Name.Builtin),
# Comparing Operators
#(r'(^|(?<=[^0-9a-z_\-]))(equal|equals|ne|lt|le|gt|ge|'
# r'greater|less|than|not|and|or)\s*($|(?=[^0-9a-z_\-]))', Operator.Word),
(r'\b(?!-)(equal|equals|ne|lt|le|gt|ge|'
r'greater|less|than|not|and|or)\b(?!-)', Operator.Word),
],
# \"[^\"\n]*\"|\'[^\'\n]*\'
'strings': [
# apparently strings can be delimited by EOL if they are continued
# in the next line
(r'"[^"\n]*("|\n)', String.Double),
(r"'[^'\n]*('|\n)", String.Single),
],
'nums': [
#(r'\d+(\s+|\.$|$)', Number.Integer),
(r'\b(?!-)\d+\b(?!-)', Number.Integer),
(r'[+-]?\d*\.\d+([eE][-+]?\d+)?', Number.Float),
(r'[+-]?\d+\.\d*([eE][-+]?\d+)?', Number.Float),
],
}
class CobolFreeformatLexer(CobolLexer):
"""
Lexer for Free format OpenCOBOL code.
*New in Pygments 1.6.*
"""
name = 'COBOLFree'
aliases = ['cobolfree']
filenames = ['*.cbl', '*.CBL']
mimetypes = []
flags = re.IGNORECASE | re.MULTILINE
tokens = {
'comment': [
(r'(\*>.*\n|^\w*\*.*$)', Comment),
],
}
请原谅死代码评论,摆脱相当多的回溯模式匹配,仍处于测试阶段,尚未致力于bitbucket。这只是源列表中的漂亮颜色,没有智能或正确性。
答案 1 :(得分:1)
正则表达式不是分析COBOL语法的正确工具;但是在将输入文本拆分为标记时可以使用它。但即使是这个更简单的任务,单靠Regex还不够。需要额外的逻辑。
根据VS COBOL II grammar Version 1.0.4标识符(他们称之为“字母用户定义的字词”)定义如下:
([0-9] + [ - ] ) [0-9]的 [A-ZA-Z] [A-ZA-Z0-9] ( [ - ] + [A-ZA-Z0-9] +)*
这个定义很复杂,因为它确保标识符至少包含一个字母。对于拆分,可以删除此要求。如果你这样做,你会得到这个标识符的简单表达式:
[0-9A-ZA-Z] +( - [0-9A-ZA-Z])*
为了在拆分时保留分隔符,只需将分隔符放入捕获组(“(”和“)”之间):
string input = "MY-TABLE(WS-INDEX)";
string[] parts = Regex.Split(input, "([0-9A-Za-z]+(-[0-9A-Za-z])*)");
结果将是(没有引号):
“”
“MY-TABLE”
“(”
“WS-INDEX”
“)”
注意强>
许多语言语法具有递归定义的嵌套结构。此外,它们具有注释和字符串转义等特殊规则,这使得解析非常困难。正则表达式可以解析这样的结构(参见Regular Expression Recursion and Matching Balanced Constructs)但是Regex表达式变得非常复杂并且很难理解,因为你必须将要解析的语言的整个语法压缩成一个单一的Regex表达式。就像你试图将C#应用程序编写为单个语句一样。使用工具专用工具,例如Irony - .NET Language Implementation Kit或Coco/R。