Perl正则表达式将项目包括或不包含在引号中并忽略空格

时间:2016-01-30 07:17:45

标签: regex perl

我将一些数据库信息提取到临时日志中。我需要编写一个正则表达式来解析它,以便将其输入分析程序。我需要按如下方式对每个“字段”进行分组:

  • YYYY-MM-DD HH:MM:SS
  • 设施
  • 严重性
  • 服务器
  • YYYY-MM-DD:HH:MM:SS
  • 时区
  • 的IPAddress
  • LegacyEmailAddress
  • FirstName(**可能包含也可能不包含引号括起来的多个单词)
  • LastName(**可能包含或不包含引号括起来的多个单词)
  • ACCTNUM
  • 的程序代码
  • UID
  • EmailAddress的
  • 的EventType
  • 来源
  • 分类

我有几乎所有的正则表达式,但有分组字段的问题。具体来说,FirstName和LastName。理想情况下,我喜欢将这些字段捕获到两个字段中(如果它们存在则删除引号)但将FirstName和LastName合并为一个也很好。

当前正则表达式的问题是,虽然它将FirstName和LastName分组到一个字段中(不理想但可接受),但似乎有一个额外的字段捕获空格。

这是我试图到达那里的正则表达式:

^(\d{4}-\d{2}-\d{2}\s+\d{2}:\d{2}:\d{2})\s+(\S+)\.(\S+)\s+(\S+)\s+(\d{4}-\d{2}-\d{2}:\s\d{2}:\d{2}:\d{2})\s+(.*?)\s+(.*?)\s+(.*?)\s+(?<!")(.*)(?!")\s+(.*)\s+(.*)\s+(.*)\s+(.*)\s+(.*)\s+(.*)\s+(.*)\s+(.*)$

以下是一些示例事件:

2016-01-29 18:19:54 local1.info server.domain.com 2016-01-29: 11:19:54 MST UNKNOWN UNKNOWN FOO "BAR BAZ" UNKNOWN UNKNOWN UNKNOWN EMAIL@EXAMPLE.COM PROFILE_CHANGE ProfileChangeProcessor A
2016-01-29 18:20:25 local4.info server.domain.com 2016-01-29: 11:20:25 MST UNKNOWN UNKNOWN "F B" BAZ ABC12345 GP SOME_UID EMAIL@EXAMPLE.COM EVENT_FROM_SOME_PROCESS UNKNOWN UNKNOWN
2016-01-29 18:23:10 local1.info server.domain.com 2016-01-29: 11:23:10 MST UNKNOWN UNKNOWN FOO BAR UNKNOWN UNKNOWN UNKNOWN EMAIL@EXAMPLE.COM SOME_CHANGE ProfileChangeProcessor AP
2016-01-29 18:26:24 local1.info server.domain.com 2016-01-29: 11:26:24 MST UNKNOWN EMAIL@EXAMPLE.COM FOO "B'Baz" UNKNOWN UNKNOWN UNKNOWN  SOME_CHANGE ProfileChangeProcessor O
2016-01-29 18:26:55 local1.info server.domain.com 2016-01-29: 11:26:55 MST UNKNOWN EMAIL@EXAMPLE.COM "FOO OR BAR" BAZ SXR12646 GP UNKNOWN  SOME_CHANGE ProfileChangeProcessor M

这是我通过Perl内联表达式运行时的输出:

$ cat foo.txt | perl -ne '/^(\d{4}-\d{2}-\d{2}\s+\d{2}:\d{2}:\d{2})\s+(\S+)\.(\S+)\s+(\S+)\s+(\d{4}-\d{2}-\d{2}:\s\d{2}:\d{2}:\d{2})\s+(.*?)\s+(.*?)\s+(.*?)\s+(?<!")(.*)(?!")\s+(.*)\s+(.*)\s+(.*)\s+(.*)\s+(.*)\s+(.*)\s+(.*)\s+(.*)$/ && print "$1|$2|$3|$4|$5|$6|$7|$8|$9|$10|$11|$12|$13|$14|$15|$16|\n"' 

2016-01-29 18:19:54|local1|info|server.domain.com|2016-01-29: 11:19:54|MST|UNKNOWN|UNKNOWN|FOO "BAR BAZ"|UNKNOWN|UNKNOWN|UNKNOWN|EMAIL@EXAMPLE.COM|PROFILE_CHANGE|ProfileChangeProcessor|A|
2016-01-29 18:20:25|local4|info|server.domain.com|2016-01-29: 11:20:25|MST|UNKNOWN|UNKNOWN|"F B" BAZ|ABC12345|GP|SOME_UID|EMAIL@EXAMPLE.COM|EVENT_FROM_SOME_PROCESS|UNKNOWN|UNKNOWN|
2016-01-29 18:23:10|local1|info|server.domain.com|2016-01-29: 11:23:10|MST|UNKNOWN|UNKNOWN|FOO BAR|UNKNOWN|UNKNOWN|UNKNOWN|EMAIL@EXAMPLE.COM|SOME_CHANGE|ProfileChangeProcessor|AP|
2016-01-29 18:26:24|local1|info|server.domain.com|2016-01-29: 11:26:24|MST|UNKNOWN|EMAIL@EXAMPLE.COM|FOO "B'Baz"|UNKNOWN|UNKNOWN|UNKNOWN||SOME_CHANGE|ProfileChangeProcessor|O|
2016-01-29 18:26:55|local1|info|server.domain.com|2016-01-29: 11:26:55|MST|UNKNOWN|EMAIL@EXAMPLE.COM|"FOO OR BAR" BAZ|SXR12646|GP|UNKNOWN||SOME_CHANGE|ProfileChangeProcessor|M|

使用上述正则表达式时的当前问题在于最后两个记录。在分组#13时,有一个空字段。我不知道如何解释这一点。如果我无法获取输出数据的字段,则无法将其正确加载到分析引擎中。总的来说,我希望看看是否有更好的方法根据我概述的内容对字段进行分组,并确保没有存在空格(或类似字符)的分组。

2 个答案:

答案 0 :(得分:0)

这就是我要做的事情:

^\s*
# date
(?<date>\d{4}-\d{2}-\d{2}\s+\d{2}:\d{2}:\d{2})
# facility.severity
\s(?<facility>\S+)\.(?<severity>\S+)
# server
\s(?<server>\S*)
# date
\s(?<otherDate>\d{4}-\d{2}-\d{2}:\s\d{2}:\d{2}:\d{2})
# time zone
\s(?<timeZone>\S*)
# ip address
\s(?<ip>\S*)
# legacy email address
\s(?<legacyEmailAddress>\S*)
# first name
\s(?|"(?<firstName>[^"\n]+)"|(?<firstName>\S*))
# last name
\s(?|"(?<lastName>[^"\n]+)"|(?<lastName>\S*))
# account number
\s(?<account>\S*)
# program code
\s(?<programCode>\S*)
# uid
\s(?<uid>\S*)
# email address
\s(?<emailAddress>\S*)
# event type
\s(?<eventType>\S*)
# source
\s(?<source>\S*)
# category
\s(?<category>\S*)
\s*$

Demo with your sample data

  • 首先,当你有这样的模式时,你 使用x修饰符,这样就可以在表达式中添加空格
  • 然后,$13到底是什么意思?为你的捕获组命名,这样就好了。
  • 由于你可以有空字段,我假设字段之间有完全一个空格分隔符。你无法真正解决这个问题
  • 添加比\S*更具体的规则不会受到影响,但这取决于您
  • 至于名称,模式是:(?|"(?<name>[^"\n]+)"|(?<name>\S*))
    • (?| ... )branch reset group。它允许您在每个备选方案中重用相同的捕获组号或名称
    • "(?<name>[^"\n]+)"捕获引用的名称
    • (?<name>\S*)捕获一个不带引号的名称...只有其中一个可以匹配,它们将进入同一个捕获组。

答案 1 :(得分:0)

这可以做得更简单。

use strict;
use warnings;

while( my $line = <DATA> ) {
    # the pattern finds any text that is either 
    # surrounded by quotation marks (") or is
    # non-whitespace. each such match is returned
    # as a field (thus the /g operator).
    my @fields = ( $line =~ /"[^\"]*"|\S+/go );
    print join('|', @fields), "\n";
}

__DATA__
2016-01-29 18:19:54 local1.info server.domain.com 2016-01-29: 11:19:54 MST UNKNOWN UNKNOWN FOO "BAR BAZ" UNKNOWN UNKNOWN UNKNOWN EMAIL@EXAMPLE.COM PROFILE_CHANGE ProfileChangeProcessor A
2016-01-29 18:20:25 local4.info server.domain.com 2016-01-29: 11:20:25 MST UNKNOWN UNKNOWN "F B" BAZ ABC12345 GP SOME_UID EMAIL@EXAMPLE.COM EVENT_FROM_SOME_PROCESS UNKNOWN UNKNOWN
2016-01-29 18:23:10 local1.info server.domain.com 2016-01-29: 11:23:10 MST UNKNOWN UNKNOWN FOO BAR UNKNOWN UNKNOWN UNKNOWN EMAIL@EXAMPLE.COM SOME_CHANGE ProfileChangeProcessor AP
2016-01-29 18:26:24 local1.info server.domain.com 2016-01-29: 11:26:24 MST UNKNOWN EMAIL@EXAMPLE.COM FOO "B'Baz" UNKNOWN UNKNOWN UNKNOWN  SOME_CHANGE ProfileChangeProcessor O
2016-01-29 18:26:55 local1.info server.domain.com 2016-01-29: 11:26:55 MST UNKNOWN EMAIL@EXAMPLE.COM "FOO OR BAR" BAZ SXR12646 GP UNKNOWN  SOME_CHANGE ProfileChangeProcessor M

这会产生

2016-01-29|18:19:54|local1.info|server.domain.com|2016-01-29:|11:19:54|MST|UNKNOWN|UNKNOWN|FOO|"BAR BAZ"|UNKNOWN|UNKNOWN|UNKNOWN|EMAIL@EXAMPLE.COM|PROFILE_CHANGE|ProfileChangeProcessor|A
2016-01-29|18:20:25|local4.info|server.domain.com|2016-01-29:|11:20:25|MST|UNKNOWN|UNKNOWN|"F B"|BAZ|ABC12345|GP|SOME_UID|EMAIL@EXAMPLE.COM|EVENT_FROM_SOME_PROCESS|UNKNOWN|UNKNOWN
2016-01-29|18:23:10|local1.info|server.domain.com|2016-01-29:|11:23:10|MST|UNKNOWN|UNKNOWN|FOO|BAR|UNKNOWN|UNKNOWN|UNKNOWN|EMAIL@EXAMPLE.COM|SOME_CHANGE|ProfileChangeProcessor|AP
2016-01-29|18:26:24|local1.info|server.domain.com|2016-01-29:|11:26:24|MST|UNKNOWN|EMAIL@EXAMPLE.COM|FOO|"B'Baz"|UNKNOWN|UNKNOWN|UNKNOWN|SOME_CHANGE|ProfileChangeProcessor|O
2016-01-29|18:26:55|local1.info|server.domain.com|2016-01-29:|11:26:55|MST|UNKNOWN|EMAIL@EXAMPLE.COM|"FOO OR BAR"|BAZ|SXR12646|GP|UNKNOWN|SOME_CHANGE|ProfileChangeProcessor|M

您可能需要删除前导和尾随双引号和空格:

foreach my $field ( @fields ) {
        $field =~ s/^\s*\"//;
        $field =~ s/\"\s*$//;
}