Question

我正在使用包含许多行数据的文本文件。我得到的格式非常讨厌，但它是一致的，这就是为什么我想在这里使用RegEx。

每个属性都以空格分隔（5个空格），从state开始，然后是city，然后是用户类型，然后是用户地址（后跟他们在地址上的年数），然后是GUID。出于安全考虑，我修改了地址，但每行都遵循相同的格式：

[{     OH     Crestline     Reseller     (1234 Alvarez Dr., 4)     a6fa960c-921a-40e6-a5ab-30cc7fb83907     }]
[{     AZ     Marana     Distributor     (1234 Union St., >1)     1f2a9252-cbac-4e17-8d4c-d5eaebb5f6b7     }]
[{     MI     Lansing     Reseller     (1234 Westmore Ave., 11)     5736c1c0-2e23-43cd-8765-c48fbe51ffee     }]

我有兴趣在这里捕捉的是城市和地址以及年数。我写了以下RegEx来实现这个目标：

\[\{[ ]{5}[A-Z]{1,}[ ]{5}([A-Za-z]{1,})[ ]{5}(?:Reseller|Distributor){1,}[ ]{5}\(([0-9]{1,}[ ][A-Za-z]{1,}[ ][A-Za-z.,]{1,}[ ][>0-9]{1,})

使用上面的表达式和示例数据的第一行，RegEx在第一组中捕获Crestline，在第二组中捕获1234 Alvarez Dr., 4。

我的问题：

是否有更清晰或更简洁的方式来编写此表达式，以便它仍然可以从行中捕获这两条信息？

Answer 1

你可以像这样更短更高效的表达：

\[\{\s{5}[A-Z]+\s{5}(\w+)[^\(]+\(([^,]+),[^0-9]+([0-9]+)\)[^\}]+\}\]

这将捕获第1组中的城市名称，第2组中的街道地址以及他/她在第3组中在该地址上花费的年数。

Answer 2

我会用：

\[\{\s{5}[A-Z]{2}\s{5}(.+?)\s{5}.+?\s{5}\(([^)]+)\)

该城市将在第1组以及第2组中的地址和年份。

<强>解释

The regular expression:

(?-imsx:\[\{\s{5}[A-Z]{2}\s{5}(.+?)\s{5}.+?\s{5}\(([^)]+)\))

matches as follows:

NODE                     EXPLANATION
----------------------------------------------------------------------
(?-imsx:                 group, but do not capture (case-sensitive)
                         (with ^ and $ matching normally) (with . not
                         matching \n) (matching whitespace and #
                         normally):
----------------------------------------------------------------------
  \[                       '['
----------------------------------------------------------------------
  \{                       '{'
----------------------------------------------------------------------
  \s{5}                    whitespace (\n, \r, \t, \f, and " ") (5
                           times)
----------------------------------------------------------------------
  [A-Z]{2}                 any character of: 'A' to 'Z' (2 times)
----------------------------------------------------------------------
  \s{5}                    whitespace (\n, \r, \t, \f, and " ") (5
                           times)
----------------------------------------------------------------------
  (                        group and capture to \1:
----------------------------------------------------------------------
    .+?                      any character except \n (1 or more times
                             (matching the least amount possible))
----------------------------------------------------------------------
  )                        end of \1
----------------------------------------------------------------------
  \s{5}                    whitespace (\n, \r, \t, \f, and " ") (5
                           times)
----------------------------------------------------------------------
  .+?                      any character except \n (1 or more times
                           (matching the least amount possible))
----------------------------------------------------------------------
  \s{5}                    whitespace (\n, \r, \t, \f, and " ") (5
                           times)
----------------------------------------------------------------------
  \(                       '('
----------------------------------------------------------------------
  (                        group and capture to \2:
----------------------------------------------------------------------
    [^)]+                    any character except: ')' (1 or more
                             times (matching the most amount
                             possible))
----------------------------------------------------------------------
  )                        end of \2
----------------------------------------------------------------------
  \)                       ')'
----------------------------------------------------------------------
)                        end of grouping
----------------------------------------------------------------------

Answer 3

您说格式是一致的，因此您可以从模式中删除格式验证。根据数据类型判断，您可以假设(不会出现在地址之前的任何地方。在这种情况下，你可以压缩它很多：

[ ]{5}.+?[ ]{5}([^ ]+).+\(([^)]+)

故障：

[ ]{5}.+?[ ]{5} - 跳过2个独立的5个空格组（中间有非贪婪，以确保它只是前两个组）
([^ ]+) - 捕获一组非空格字符（这是城市）
.+\( - 向前跳过，直至找到(
([^)]+) - 在括号内捕获（这是多年的地址）

优化城市和地址捕获RegEx

3 个答案: