需要帮助迭代特定格式的文件

时间:2014-01-17 01:45:36

标签: perl parsing csv

我正在尝试解析具有特定格式的txt文件并将其转换为CSV文件。 但是我遇到了两个问题:

  1. 我需要跳过分隔每个条目的标题(4行,第一行以\ n开头)
  2. 只读最后一个条目。我不确定我做错了什么,所以它会读取文本文件中的所有条目。
  3. 我的代码:

    my $grammar = qr!
            ( ?(DEFINE)
               (?<Identifier> [^=\n]+ )
               (?<Statement>
                   (?: # Begin alternation
                       " #Opening quotes
                       [^"]+? # Any non-quotes (including a new line)
                       " # Closing quotes
                      | [^\n]+ # Or a single line
                   )   # End alternation
                )
    
           )
    
        !x;
    
        my $file = do { local $/; <> }; #Slurp file named on command line
        my %columns;
        while( $file =~
           m{ ((?&Identifier))[\t ]*=[ \t]*((?&Statement)) $grammar}xgc )
        {
           my ($header,$value) = ($1,$2);
    
               # Remove leading spaces and quote variable if it contains commas:
           for($header,$value) { s/^\s+//mg; /,/ and s/^|$/"/g }
    
               # Substitute \n with \\n to make multi-line values single-line:
           for($value) { chomp; s/\n/\\n/g }
    
           $columns{$header}=$value
        }
    
        print join "," => sort keys %columns; # Print column headers
        print "\n";
        print join "," => map { $columns{$_} } sort keys %columns; # Column content
        print "\n";
    

    输入文件如下所示:

    OPERATION_CONTEXT server:.oc_name alarm_object 1
    On director: server:.temip.prd1149_director
    AT Thu, Jan 16, 2014 10:33:44 PM All Attributes
    
                                 Identifier = 1
                                      State = Outstanding
                             Problem Status = Not-Handled
                      Clearance Report Flag = False
                            Escalated Alarm = False
                         Creation Timestamp = Thu, Jan 16, 2014 10:21:17 PM
                             Managed Object = NETACT server:.NETACT51 BSC 716499 BCF 123
                            Target Entities = { NETACT server:.NETACT51 BSC 716499 BCF 123 }
                                 Alarm Type = EnvironmentalAlarm
                                 Event Time = Thu, Jan 16, 2014 10:17:14 PM
                             Probable Cause = Indeterminate
                          Specific Problems = { 7409 }
                    Notification Identifier = 2433009629
                                     Domain = Domain server:.netact51_dom
                               Alarm Origin = IncomingAlarm
                         Perceived Severity = Critical
                            Additional Text = "ALARMA CRITICA SISTEMA DAS 1900
                                              #S#10497409      ***                                       ZONA TECNICA SANTI
                                              PLMN-PLMN/BSC-716499/BCF-123
    
                                              SC_logical_name:9344;"
                          Original Severity = Critical
                        Original Event Time = Thu, Jan 16, 2014 10:17:14 PM
                                Outage Flag = False
                        Problem Occurrences = 1 Problems
                   GPP3 Problem Occurrences = 0 Problems
               Critical Problem Occurrences = 1 Problems
                  Major Problem Occurrences = 0 Problems
                  Minor Problem Occurrences = 0 Problems
                Warning Problem Occurrences = 0 Problems
          Indeterminate Problem Occurrences = 0 Problems
                  Clear Problem Occurrences = 0 Problems
                                   SA Total = 0 Alarms
                                     Comuna = "HUECHURABA"
                                 CatCliente = "CAV"
                                   Nemonico = "BSMT6_PZANF3"
    
    OPERATION_CONTEXT server:.oc_name alarm_object 2
    On director: server:.temip.prd1149_director
    AT Thu, Jan 16, 2014 10:33:44 PM All Attributes
    
                                 Identifier = 2
                                      State = Outstanding
                             Problem Status = Not-Handled
                      Clearance Report Flag = False
                            Escalated Alarm = False
                         Creation Timestamp = Thu, Jan 16, 2014 10:14:03 PM
                       Clearance Time Stamp = Thu, Jan 16, 2014 10:29:08 PM
                             Managed Object = NETACT server:.NETACT51 BSC 206259 BCF 103
                            Target Entities = { NETACT server:.NETACT51 BSC 206259 BCF 103 }
                                 Alarm Type = EnvironmentalAlarm
                                 Event Time = Thu, Jan 16, 2014 10:29:37 PM
                             Probable Cause = Indeterminate
                          Specific Problems = { 7409 }
                    Notification Identifier = 3780327614
                                     Domain = Domain server:.netact51_dom
                               Alarm Origin = IncomingAlarm
                         Perceived Severity = Critical
                            Additional Text = "ALARMA CRITICA SISTEMA DAS 1900
                                              #S#10497409      ***                                       ZONA TECNICA CENTR
                                              Merval                           BSC VLP7
                                              PLMN-PLMN/BSC-206259/BCF-103
                                              ALARMA CRITICA SISTEMA DAS 1900
    
                                              SC_logical_name:94681;"
                          Original Severity = Critical
                        Original Event Time = Thu, Jan 16, 2014 10:10:01 PM
                                Outage Flag = False
                        Problem Occurrences = 4 Problems
                   GPP3 Problem Occurrences = 0 Problems
               Critical Problem Occurrences = 4 Problems
                  Major Problem Occurrences = 0 Problems
                  Minor Problem Occurrences = 0 Problems
                Warning Problem Occurrences = 0 Problems
          Indeterminate Problem Occurrences = 0 Problems
                  Clear Problem Occurrences = 3 Problems
                                   SA Total = 6 Alarms
                                     Comuna = "VINA DEL MAR"
                                 CatCliente = "CAV"
                                   Nemonico = "BVLP7_MVALF9"
    
    OPERATION_CONTEXT server:.oc_name alarm_object 3
    On director: server:.temip.prd1149_director
    AT Thu, Jan 16, 2014 10:33:45 PM All Attributes
    
                                 Identifier = 3
                                      State = Outstanding
                             Problem Status = Not-Handled
                      Clearance Report Flag = False
                            Escalated Alarm = False
                         Creation Timestamp = Thu, Jan 16, 2014 09:41:59 PM
                             Managed Object = NETACT server:.NETACT51 BSC 938189 BCF 61
                            Target Entities = { NETACT server:.NETACT51 BSC 938189 BCF 61 }
                                 Alarm Type = EnvironmentalAlarm
                                 Event Time = Thu, Jan 16, 2014 09:37:58 PM
                             Probable Cause = Indeterminate
                          Specific Problems = { 7405 }
                    Notification Identifier = 1757596347
                                     Domain = Domain server:.netact51_dom
                               Alarm Origin = IncomingAlarm
                         Perceived Severity = Major
                            Additional Text = "NUSS FAILURE, RECTIFIER_1 ALARM
                                              #S#10497405      **                                        ZONA TECNICA CENTR
                                              Pelluhue Playa
                                              PLMN-PLMN/BSC-938189/BCF-61
    
                                              SC_logical_name:9679;"
                          Original Severity = Major
                        Original Event Time = Thu, Jan 16, 2014 09:37:58 PM
                                Outage Flag = False
                        Problem Occurrences = 1 Problems
                   GPP3 Problem Occurrences = 0 Problems
               Critical Problem Occurrences = 0 Problems
                  Major Problem Occurrences = 1 Problems
                  Minor Problem Occurrences = 0 Problems
                Warning Problem Occurrences = 0 Problems
          Indeterminate Problem Occurrences = 0 Problems
                  Clear Problem Occurrences = 0 Problems
                                   SA Total = 0 Alarms
                                     Comuna = "PELLUHUE"
                                 CatCliente = "UNIC_SITE"
                                   Nemonico = "BTAL2_PYUEF6"
    

    非常感谢您提供给我的任何帮助!

3 个答案:

答案 0 :(得分:2)

以下内容未涉及您的脚本,但提供了逐行解析方法:

use strict;
use warnings;

my ( $showHeader, $lastID, @header, @columns ) = ( 1, '' );

while (<>) {
    if ( my ( $identifier, $statement ) = /^\s+(\S[^=]+)\s+=\s+(.+)/ ) {

        if (    $identifier eq 'Managed Object'
            and $lastID ne 'Clearance Time Stamp' )
        {
            push @header, 'Clearance Time Stamp' if $showHeader;
            push @columns, '';
        }

        if ( $identifier eq 'Additional Text' ) {
            while (<>) {
                my ($additional) = /^\s+(\S.+)/ or next;
                $statement .= $additional;
                last if $additional =~ /SC_logical_name/;
            }
            $statement =~ s/\s+/ /g;
        }

        push @header, $identifier if $showHeader;
        push @columns, $statement;

        if ( $identifier eq 'Nemonico' ) {
            if ($showHeader) {
                print +( join ',', @header ), "\n";
                $showHeader = 0;
            }

            print +( join ',', map { $_ = qq/"$_"/ if /,/ and !/^"/; $_ } @columns ), "\n";
            undef @columns;
        }
        $lastID = $identifier;
    }
}

用法:perl script.pl inFile [>outFile.csv]

最后一个可选参数将输出定向到文件。

多个空格被字段Additional Text中的单个空格替换。

希望这有帮助!

答案 1 :(得分:0)

你的Perl风格很不寻常,我觉得阅读起来很棘手,但你把整个文件视为一个长记录。标题行会被忽略,因为它们看起来不像Identifer = Statement

这意味着您的哈希元素被设置为为每个标识符找到的 last 值 - 通常这是最终记录的内容。

我相信你会更少依赖正则表达式。你现在拥有它的方式(正如你所发现的那样)很难调试。

答案 2 :(得分:0)

你的风格当然不寻常,但你的脚本确实有效。问题是你用下一个条目破坏了每个找到的条目。你说你需要跳过分隔每个条目的标题'但是据我所知你的脚本已经这样做了,所以也许我不明白这一点。无论如何,这些变化应该解决你的问题2:

my %columns;
my $current_entry = ''; # add this
while( $file =~
    m{ ((?&Identifier))[\t ]*=[ \t]*((?&Statement)) $grammar}xgc ) {

    # ...removed

    for($value) { chomp; s/\n/\\n/g }

    # add this check to separate each entry
    if ($header eq 'Identifier ') {
        $current_entry = $value;
    }
    $columns{$current_entry}{$header}=$value;
}

# need to change the way you print the results
# assumes there is always a Identifier: 1
# and that the first entry contains all possible headers

my $first = $columns{1};
my @headers = sort keys %$first;
print join "," => @headers; # Print column headers
print "\n";
for my $key (sort {$a <=> $b} keys %columns) {
    my $entry = $columns{$key};
    print join "," => map { $entry->{$_} } @headers; # Column content
    print "\n";
}