Question

我有一个长度在1到1000之间的文本。我希望从文本中提取以下子字符串。

子字符串，格式为ABCxx / ABCx，其中ABC始终是英文字母，而x / xx是一个数字，其范围可以从0到99（数字长度为1或2）。以下正则表达式为我完成了提取此子字符串的工作-[a-zA-Z]{3}[0-9]{1,2}
子字符串，格式为<space>ABC<space>，ABC（文本中的最后一个子字符串/单词）和ABC（文本中的第一个子字符串）。基本上，我试图在文本中找到一个由空格分隔的3个字母的单词。
为了获得上述匹配，我有以下正则表达式。

[ ][a-zA-Z]{3}[ ], [ ][a-zA-Z]{3} and [a-zA-Z]{3}[ ]

与2相同，但三个字符串也可以放在[ABC]之类的方括号中。

\[([a-zA-Z]{3})\]

由于模式或多或少相似，因此是否有将这5种模式组合在一起的原因？

例如：ABC catmat dogdog [rat] LAN45 eat HGF1 jkhgkj abc

这里有效的匹配项是ABC，rat，LAN45，eat，HGF1，abc。

Answer 1

R = /
    \p{L}{3}\d{1,2}    # match 3 letters followed by 1 or 2 digits
    |                  # or
    (?<=\A|\p{Space})  # match start of string or a space in a pos lookbehind
    (?:                # begin a non-capture group
      \p{L}{3}         # match three letters
      |                # or
      \[\p{L}{3}\]     # match three letters surrounded by brackets
    )                  # end of non-capture group
    (?=\p{Space}|\z)   # match space or end of string in a pos lookahead
    /x                 # free-spacing regex definition mode

"ABC catmat dogdog [rat] LAN45 eat HGF1 jkhgkj abc".scan R
   #=> ["ABC", "[rat]", "LAN45", "eat", "HGF1", "abc"]

此正则表达式通常按惯例编写（不是自由间距模式）：

R = /\p{L}{3}\d{1,2}|(?<=\A| )(?:\p{L}{3}\[\p{L}{3}\])(?= |\z)/

现在考虑：

 "ABCD123 [efg]456".scan R
   #=> ["BCD12"]

我认为这与问题的陈述是一致的，但是如果"BCD12"不匹配（如果前面有字母或后面有数字，则两者均适用），则正则表达式应为修改如下。

R = /
    (?<=\A|\p{Space})  # match start of string or a space in a pos lookbehind
    (?:                # begin a non-capture group
      \p{L}{3}         # match three letters
      \d{,2}           # match 0, 1 or 2 digits      
      |                # or
      \[\p{L}{3}\]     # match three letters surrounded by brackets
    )                  # end of non-capture group
    (?=\p{Space}|\z)   # match space or end of string in a pos lookahead
    /x                 # free-spacing regex definition mode

"ABC catmat dogdog [rat] XLAN45 eat HGF123 jkhgkj abc".scan R
  #=> ["ABC", "[rat]", "eat", "abc"]

请注意，在两个正则表达式中，我都将\p{Space}替换为空格字符。在自由空间模式下，在解析正则表达式之前删除空格，因此必须将其写入\p{Space}，[[:space:]]，[ ]（包含空格的字符类），\转义的空格字符，或者\s（如果适用）为空格字符（包括空格，换行符，制表符和其他一些字符）。

Answer 2

谢谢大家的回答。这个正则表达式帮了我大忙。

（\ b [a-zA-Z] {3}（[0-9] {1,2}）？\ b）

用于[ABC]，ABC和ABCxx等格式的正则表达式，其中xx是数字

2 个答案: