我正在尝试解析具有特定格式的txt文件并将其转换为CSV文件。 但是我遇到了两个问题:
我的代码:
my $grammar = qr!
( ?(DEFINE)
(?<Identifier> [^=\n]+ )
(?<Statement>
(?: # Begin alternation
" #Opening quotes
[^"]+? # Any non-quotes (including a new line)
" # Closing quotes
| [^\n]+ # Or a single line
) # End alternation
)
)
!x;
my $file = do { local $/; <> }; #Slurp file named on command line
my %columns;
while( $file =~
m{ ((?&Identifier))[\t ]*=[ \t]*((?&Statement)) $grammar}xgc )
{
my ($header,$value) = ($1,$2);
# Remove leading spaces and quote variable if it contains commas:
for($header,$value) { s/^\s+//mg; /,/ and s/^|$/"/g }
# Substitute \n with \\n to make multi-line values single-line:
for($value) { chomp; s/\n/\\n/g }
$columns{$header}=$value
}
print join "," => sort keys %columns; # Print column headers
print "\n";
print join "," => map { $columns{$_} } sort keys %columns; # Column content
print "\n";
输入文件如下所示:
OPERATION_CONTEXT server:.oc_name alarm_object 1
On director: server:.temip.prd1149_director
AT Thu, Jan 16, 2014 10:33:44 PM All Attributes
Identifier = 1
State = Outstanding
Problem Status = Not-Handled
Clearance Report Flag = False
Escalated Alarm = False
Creation Timestamp = Thu, Jan 16, 2014 10:21:17 PM
Managed Object = NETACT server:.NETACT51 BSC 716499 BCF 123
Target Entities = { NETACT server:.NETACT51 BSC 716499 BCF 123 }
Alarm Type = EnvironmentalAlarm
Event Time = Thu, Jan 16, 2014 10:17:14 PM
Probable Cause = Indeterminate
Specific Problems = { 7409 }
Notification Identifier = 2433009629
Domain = Domain server:.netact51_dom
Alarm Origin = IncomingAlarm
Perceived Severity = Critical
Additional Text = "ALARMA CRITICA SISTEMA DAS 1900
#S#10497409 *** ZONA TECNICA SANTI
PLMN-PLMN/BSC-716499/BCF-123
SC_logical_name:9344;"
Original Severity = Critical
Original Event Time = Thu, Jan 16, 2014 10:17:14 PM
Outage Flag = False
Problem Occurrences = 1 Problems
GPP3 Problem Occurrences = 0 Problems
Critical Problem Occurrences = 1 Problems
Major Problem Occurrences = 0 Problems
Minor Problem Occurrences = 0 Problems
Warning Problem Occurrences = 0 Problems
Indeterminate Problem Occurrences = 0 Problems
Clear Problem Occurrences = 0 Problems
SA Total = 0 Alarms
Comuna = "HUECHURABA"
CatCliente = "CAV"
Nemonico = "BSMT6_PZANF3"
OPERATION_CONTEXT server:.oc_name alarm_object 2
On director: server:.temip.prd1149_director
AT Thu, Jan 16, 2014 10:33:44 PM All Attributes
Identifier = 2
State = Outstanding
Problem Status = Not-Handled
Clearance Report Flag = False
Escalated Alarm = False
Creation Timestamp = Thu, Jan 16, 2014 10:14:03 PM
Clearance Time Stamp = Thu, Jan 16, 2014 10:29:08 PM
Managed Object = NETACT server:.NETACT51 BSC 206259 BCF 103
Target Entities = { NETACT server:.NETACT51 BSC 206259 BCF 103 }
Alarm Type = EnvironmentalAlarm
Event Time = Thu, Jan 16, 2014 10:29:37 PM
Probable Cause = Indeterminate
Specific Problems = { 7409 }
Notification Identifier = 3780327614
Domain = Domain server:.netact51_dom
Alarm Origin = IncomingAlarm
Perceived Severity = Critical
Additional Text = "ALARMA CRITICA SISTEMA DAS 1900
#S#10497409 *** ZONA TECNICA CENTR
Merval BSC VLP7
PLMN-PLMN/BSC-206259/BCF-103
ALARMA CRITICA SISTEMA DAS 1900
SC_logical_name:94681;"
Original Severity = Critical
Original Event Time = Thu, Jan 16, 2014 10:10:01 PM
Outage Flag = False
Problem Occurrences = 4 Problems
GPP3 Problem Occurrences = 0 Problems
Critical Problem Occurrences = 4 Problems
Major Problem Occurrences = 0 Problems
Minor Problem Occurrences = 0 Problems
Warning Problem Occurrences = 0 Problems
Indeterminate Problem Occurrences = 0 Problems
Clear Problem Occurrences = 3 Problems
SA Total = 6 Alarms
Comuna = "VINA DEL MAR"
CatCliente = "CAV"
Nemonico = "BVLP7_MVALF9"
OPERATION_CONTEXT server:.oc_name alarm_object 3
On director: server:.temip.prd1149_director
AT Thu, Jan 16, 2014 10:33:45 PM All Attributes
Identifier = 3
State = Outstanding
Problem Status = Not-Handled
Clearance Report Flag = False
Escalated Alarm = False
Creation Timestamp = Thu, Jan 16, 2014 09:41:59 PM
Managed Object = NETACT server:.NETACT51 BSC 938189 BCF 61
Target Entities = { NETACT server:.NETACT51 BSC 938189 BCF 61 }
Alarm Type = EnvironmentalAlarm
Event Time = Thu, Jan 16, 2014 09:37:58 PM
Probable Cause = Indeterminate
Specific Problems = { 7405 }
Notification Identifier = 1757596347
Domain = Domain server:.netact51_dom
Alarm Origin = IncomingAlarm
Perceived Severity = Major
Additional Text = "NUSS FAILURE, RECTIFIER_1 ALARM
#S#10497405 ** ZONA TECNICA CENTR
Pelluhue Playa
PLMN-PLMN/BSC-938189/BCF-61
SC_logical_name:9679;"
Original Severity = Major
Original Event Time = Thu, Jan 16, 2014 09:37:58 PM
Outage Flag = False
Problem Occurrences = 1 Problems
GPP3 Problem Occurrences = 0 Problems
Critical Problem Occurrences = 0 Problems
Major Problem Occurrences = 1 Problems
Minor Problem Occurrences = 0 Problems
Warning Problem Occurrences = 0 Problems
Indeterminate Problem Occurrences = 0 Problems
Clear Problem Occurrences = 0 Problems
SA Total = 0 Alarms
Comuna = "PELLUHUE"
CatCliente = "UNIC_SITE"
Nemonico = "BTAL2_PYUEF6"
非常感谢您提供给我的任何帮助!
答案 0 :(得分:2)
以下内容未涉及您的脚本,但提供了逐行解析方法:
use strict;
use warnings;
my ( $showHeader, $lastID, @header, @columns ) = ( 1, '' );
while (<>) {
if ( my ( $identifier, $statement ) = /^\s+(\S[^=]+)\s+=\s+(.+)/ ) {
if ( $identifier eq 'Managed Object'
and $lastID ne 'Clearance Time Stamp' )
{
push @header, 'Clearance Time Stamp' if $showHeader;
push @columns, '';
}
if ( $identifier eq 'Additional Text' ) {
while (<>) {
my ($additional) = /^\s+(\S.+)/ or next;
$statement .= $additional;
last if $additional =~ /SC_logical_name/;
}
$statement =~ s/\s+/ /g;
}
push @header, $identifier if $showHeader;
push @columns, $statement;
if ( $identifier eq 'Nemonico' ) {
if ($showHeader) {
print +( join ',', @header ), "\n";
$showHeader = 0;
}
print +( join ',', map { $_ = qq/"$_"/ if /,/ and !/^"/; $_ } @columns ), "\n";
undef @columns;
}
$lastID = $identifier;
}
}
用法:perl script.pl inFile [>outFile.csv]
最后一个可选参数将输出定向到文件。
多个空格被字段Additional Text
中的单个空格替换。
希望这有帮助!
答案 1 :(得分:0)
你的Perl风格很不寻常,我觉得阅读起来很棘手,但你把整个文件视为一个长记录。标题行会被忽略,因为它们看起来不像Identifer = Statement
。
这意味着您的哈希元素被设置为为每个标识符找到的 last 值 - 通常这是最终记录的内容。
我相信你会更少依赖正则表达式。你现在拥有它的方式(正如你所发现的那样)很难调试。
答案 2 :(得分:0)
你的风格当然不寻常,但你的脚本确实有效。问题是你用下一个条目破坏了每个找到的条目。你说你需要跳过分隔每个条目的标题'但是据我所知你的脚本已经这样做了,所以也许我不明白这一点。无论如何,这些变化应该解决你的问题2:
my %columns;
my $current_entry = ''; # add this
while( $file =~
m{ ((?&Identifier))[\t ]*=[ \t]*((?&Statement)) $grammar}xgc ) {
# ...removed
for($value) { chomp; s/\n/\\n/g }
# add this check to separate each entry
if ($header eq 'Identifier ') {
$current_entry = $value;
}
$columns{$current_entry}{$header}=$value;
}
# need to change the way you print the results
# assumes there is always a Identifier: 1
# and that the first entry contains all possible headers
my $first = $columns{1};
my @headers = sort keys %$first;
print join "," => @headers; # Print column headers
print "\n";
for my $key (sort {$a <=> $b} keys %columns) {
my $entry = $columns{$key};
print join "," => map { $entry->{$_} } @headers; # Column content
print "\n";
}