Perl的新功能 - 解析文件并用动态值替换模式

时间:2014-01-08 00:57:54

标签: perl shell parsing csv

我是Perl的新手,我目前正在尝试将bash脚本转换为perl。

我的脚本用于转换nmon文件(AIX / Linux perf监视工具),它将nmon文件存在于目录中,grep并将特定部分重定向到临时文件,grep并将关联的时间戳重定向到另一个文件。

然后,它将数据解析为最终的csv文件,该文件将被第三个被利用的工具编入索引。

示例NMON数据如下所示:

    TOP,%CPU Utilisation
TOP,+PID,Time,%CPU,%Usr,%Sys,Threads,Size,ResText,ResData,CharIO,%RAM,Paging,Command,WLMclass
TOP,5165226,T0002,10.93,9.98,0.95,1,54852,4232,51220,311014,0.755,1264,PatrolAgent,Unclassified
TOP,5365876,T0002,1.48,0.81,0.67,135,85032,132,84928,38165,1.159,0,db2sysc,Unclassified
TOP,5460056,T0002,0.32,0.27,0.05,1,5060,616,4704,1719,0.072,0,db2kmchan64.v9,Unclassified

字段“Time”(在TON2中看起来在NMON中真的称为ZZZZ)是一个特定的NMON时间戳,此时间戳的实际值稍后(在专用部分中)出现在NMON文件中,如下所示:

ZZZZ,T0001,00:09:55,01-JAN-2014
ZZZZ,T0002,00:13:55,01-JAN-2014
ZZZZ,T0003,00:17:55,01-JAN-2014
ZZZZ,T0004,00:21:55,01-JAN-2014
ZZZZ,T0005,00:25:55,01-JAN-2014

NMON格式非常具体,无法在不进行解析的情况下直接利用,时间戳必须与相应的值相关联。 (NMON文件几乎就像是许多不同csv文件的串联,每个文件都有不同的格式,不同的文件等等。)

我编写了以下bash脚本来解析我感兴趣的部分(“TOP”部分代表顶级进程cpu,mem,每个主机的io统计数据)

#!/bin/bash

# set -x

################################################################
# INFORMATION
################################################################

# nmon2csv_TOP.sh

# Convert TOP section of nmon files to csv

# CAUTION: This script is expected to be launched by the main workflow
# $DST and DST_CONVERTED_TOP are being exported by it, if not this script will exit at launch time

################################################################
# VARS
################################################################

#  Location of NMON files
NMON_DIR=${DST}

# Location of generated files
OUTPUT_DIR=${DST_CONVERTED_TOP}

# Temp files
rawdatafile=/tmp/temp_rawdata.$$.temp
timestampfile=/tmp/temp_timestamp.$$.temp

# Main Output file
finalfile=${DST_CONVERTED_TOP}/NMON_TOP_processed_at_date_`date '+%F'`.csv

###########################
# BEGIN OF WORK
###########################

# Verify exported var are not null
if [ -z ${NMON_DIR} ]; then
    echo -e "\nERROR: Var NMON_DIR is null!\n" && exit 1
elif [ -z ${OUTPUT_DIR} ]; then
    echo -e "\nERROR: Var OUTPUT_DIR is null!\n" && exit 1
fi

# Check if temp and output files already exists
if [ -s ${rawdatafile} ]; then
    rm -f ${rawdatafile}

elif [ -s ${timestampfile} ]; then
    rm -f ${timestampfile}

elif [ -s ${finalfile} ]; then
    rm -f ${finalfile}

fi

# Get current location
PWD=`pwd`

# Go to NMON files location
cd ${NMON_DIR}

# For each NMON file present:

# To restrict to only PROD env: `ls *.nmon | grep -E -i 'sp|gp|ge'`
for NMON_FILE in `ls *.nmon | grep -E -i 'sp|gp|ge'`; do

# Set Hostname identification
serialnum=`grep 'AAA,SerialNumber,' ${NMON_FILE} | awk -F, '{print $3}' OFS=, | tr [:lower:] [:upper:]`
hostname=`grep 'AAA,host,' ${NMON_FILE} | awk -F, '{print $3}' OFS=, | tr [:lower:] [:upper:]`

# Grep and redirect TOP Section
grep 'TOP' ${NMON_FILE} | grep -v 'AAA,version,TOPAS-NMON' | grep -v 'TOP,%CPU Utilisation' > ${rawdatafile}

# Grep and redirect associated timestamps (ZZZZ)
grep 'ZZZZ' ${NMON_FILE}> ${timestampfile}

# Begin of work

while IFS=, read TOP PID Time Pct_CPU Pct_Usr Pct_Sys Threads Size ResText ResData CharIO Pct_RAM Paging Command WLMclass
    do

        timestamp=`grep ${Time} ${timestampfile} | awk -F, '{print $4 " "$3}' OFS=,`
        echo ${serialnum},${hostname},${timestamp},${Time},${PID},${Pct_CPU},${Pct_Usr},${Pct_Sys},${Threads},${Size},${ResText},${ResData},${CharIO},${Pct_RAM},${Paging},${Command},${WLMclass} \
        | grep -v '+PID,%CPU,%Usr,%Sys,Threads,Size,ResText,ResData,CharIO,%RAM,Paging,Command,WLMclass' >> ${finalfile}

    done < ${rawdatafile}

    echo -e "INFO: Done for Serialnum: ${serialnum} Hostname: ${hostname}"

done

# Go back to initial location
cd ${PWD}


###########################
# END OF WORK
###########################

这可以按原样工作并生成一个主csv文件(你会在代码中看到我自愿不在文件中保留csv头),这是所有已解析主机的串联。

但是,我有一个非常大量的主机要处理每天(大约3000个主机),使用这个当前代码,在最坏的情况下,它可能需要几分钟来生成1个主机的数据,每个主机数量多个分钟变得很容易...... ...

所以,这段代码真的不足以处理这么多数据

10个主机代表大约200,000行,最终代表大约20 MB的csv文件。 那不是那么多,但我认为shell脚本可能不是管理这样一个过程的更好选择......

我想perl在这个任务上要好得多(即使shell脚本可能会得到改进),但我对perl的了解(目前)非常差,这就是为什么我会请求你的帮助...我认为这个代码在perl中应该很简单,但我现在不能让它工作......

一个人用来开发一个perl脚本来管理NMON文件并将它们转换为sql文件(将这些数据转储到数据库中),我上演它以使用它的功能并借助一些shell脚本我管理sql文件来获取我的最终csv文件。

但是TOP部分没有集成到perl脚本中,如果没有重新开发就无法使用它。

有问题的代码:

#!/usr/bin/perl
# Program name: nmon2mysql.pl
# Purpose - convert nmon.csv file(s) into mysql insert file
# Author - Bruce Spencer
# Disclaimer:  this provided "as is".  
# Date - March 2007
#
$nmon2mysql_ver="1.0. March 2007";

use Time::Local;


#################################################
##  Your Customizations Go Here            ##
#################################################

#  Source directory for nmon csv files
my $NMON_DIR=$ENV{DST_TMP};
my $OUTPUT_DIR=$ENV{DST_CONVERTED_CPU_ALL};


# End "Your Customizations Go Here".  
# You're on your own, if you change anything beyond this line :-)

####################################################################
#############       Main Program            ############
####################################################################

# Initialize common variables
&initialize;

# Process all "nmon" files located in the $NMON_DIR
# @nmon_files=`ls $NMON_DIR/*.nmon $NMON_DIR/*.csv`;
@nmon_files=`ls $NMON_DIR/*.nmon`;

if (@nmon_files eq 0 ) { die ("No \*.nmon or csv files found in $NMON_DIR\n"); }

@nmon_files=sort(@nmon_files);
chomp(@nmon_files);

foreach $FILENAME ( @nmon_files ) {

  @cols= split(/\//,$FILENAME);
  $BASEFILENAME= $cols[@cols-1];

  unless (open(INSERT, ">$OUTPUT_DIR/$BASEFILENAME.sql")) { 
    die("Can not open /$OUTPUT_DIR/$BASEFILENAME.sql\n"); 
  }
  print INSERT ("# nmon version: $NMONVER\n");
  print INSERT ("# AIX version: $AIXVER\n");
  print INSERT ("use nmon;\n");

  $start=time();
  @now=localtime($start);
  $now=join(":",@now[2,1,0]);
  print ("$now: Begin processing file = $FILENAME\n");

  # Parse nmon file, skip if unsuccessful
  if (( &get_nmon_data ) gt 0 ) { next; }
  $now=time();
  $now=$now-$start;
  print ("\t$now: Finished get_nmon_data\n");


  # Static variables (number of fields always the same)
  #@static_vars=("LPAR","CPU_ALL","FILE","MEM","PAGE","MEMNEW","MEMUSE","PROC");
  #@static_vars=("LPAR","CPU_ALL","FILE","MEM","PAGE","MEMNEW","MEMUSE");

  @static_vars=("CPU_ALL");

  foreach $key (@static_vars) {
     &mk_mysql_insert_static($key);;
     $now=time();
     $now=$now-$start;
     print ("\t$now: Finished $key\n");
  } # end foreach



  # Dynamic variables (variable number of fields)
  #@dynamic_vars=("DISKBSIZE","DISKBUSY","DISKREAD","DISKWRITE","DISKXFER","ESSREAD","ESSWRITE","ESSXFER","IOADAPT","NETERROR","NET","NETPACKET");

  @dynamic_vars=("");

  foreach $key (@dynamic_vars) {
    &mk_mysql_insert_variable($key);;
    $now=time();
    $now=$now-$start;
    print ("\t$now: Finished $key\n");
  }

  close(INSERT);
#  system("gzip","$FILENAME");

}
exit(0);


############################################
#############  Subroutines  ############
############################################

##################################################################
## Extract CPU_ALL data for Static fields
##################################################################
sub mk_mysql_insert_static {

my($nmon_var)=@_; 
my $table=lc($nmon_var);

my @rawdata;
my $x;
my @cols;
my $comma;
my $TS;
my $n;


  @rawdata=grep(/^$nmon_var,/, @nmon);

  if (@rawdata < 1) { return(1); }

  @rawdata=sort(@rawdata);

  @cols=split(/,/,$rawdata[0]);
  $x=join(",",@cols[2..@cols-1]);
  $x=~ s/\%/_PCT/g;
  $x=~ s/\(MB\)/_MB/g;
  $x=~ s/-/_/g;
  $x=~ s/ /_/g;
  $x=~ s/__/_/g;
  $x=~ s/,_/,/g;
  $x=~ s/_,/,/g;
  $x=~ s/^_//;
  $x=~ s/_$//;

  print INSERT (qq|insert into $table (serialnum,hostname,mode,nmonver,time,ZZZZ,$x) values\n| );

  $comma="";
  $n=@cols;
  $n=$n-1; # number of columns -1 

  for($i=1;$i<@rawdata;$i++){ 

    $TS=$UTC_START + $INTERVAL*($i);

    @cols=split(/,/,$rawdata[$i]);
    $x=join(",",@cols[2..$n]);
    $x=~ s/,,/,-1,/g; # replace missing data ",," with a ",-1,"

    print INSERT (qq|$comma("$SN","$HOSTNAME","$MODE","$NMONVER",$TS,"$DATETIME{@cols[1]}",$x)| );

    $comma=",\n";
  }
  print INSERT (qq|;\n\n|);

} # end mk_mysql_insert

##################################################################
## Extract CPU_ALL data for variable fields
##################################################################
sub mk_mysql_insert_variable {

my($nmon_var)=@_; 
my $table=lc($nmon_var);

my @rawdata;
my $x;
my $j;
my @cols;
my $comma;
my $TS;
my $n;
my @devices;


  @rawdata=grep(/^$nmon_var,/, @nmon);

  if ( @rawdata < 1) { return; }

  @rawdata=sort(@rawdata);

  $rawdata[0]=~ s/\%/_PCT/g;
  $rawdata[0]=~ s/\(/_/g;
  $rawdata[0]=~ s/\)/_/g;
  $rawdata[0]=~ s/ /_/g;
  $rawdata[0]=~ s/__/_/g;
  $rawdata[0]=~ s/,_/,/g;

  @devices=split(/,/,$rawdata[0]);
  print INSERT (qq|insert into $table (serialnum,hostname,time,ZZZZ,device,value) values\n| );

  $n=@rawdata;
  $n--; 
  for($i=1;$i<@rawdata;$i++){ 

    $TS=$UTC_START + $INTERVAL*($i);
    $rawdata[$i]=~ s/,$//;
    @cols=split(/,/,$rawdata[$i]);

      print INSERT (qq|\n("$SN","$HOSTNAME",$TS,"$DATETIME{$cols[1]}","$devices[2]",$cols[2])| );
    for($j=3;$j<@cols;$j++){
      print INSERT (qq|,\n("$SN","$HOSTNAME",$TS,"$DATETIME{$cols[1]}","$devices[$j]",$cols[$j])| );
    }
    if ($i < $n) { print INSERT (","); } 
  }
  print INSERT (qq|;\n\n|);

} # end mk_mysql_insert_variable

########################################################
### Get an nmon setting from csv file            ###
### finds first occurance of $search             ###
### Return the selected column...$return_col     ###
### Syntax:                                      ###
###     get_setting($search,$col_to_return,$separator)##
########################################################

sub get_setting {

my $i;
my $value="-1";
my ($search,$col,$separator)= @_;    # search text, $col, $separator

for ($i=0; $i<@nmon; $i++){

  if ($nmon[$i] =~ /$search/ ) {
    $value=(split(/$separator/,$nmon[$i]))[$col];
    $value =~ s/["']*//g;  #remove non alphanum characters
    return($value);
    } # end if

  } # end for

return($value);

} # end get_setting

#####################
##  Clean up       ##
#####################
sub clean_up_line {

    # remove characters not compatible with nmon variable
    # Max rrdtool variable length is 19 chars
    # Variable can not contain special characters (% - () )
    my ($x)=@_; 

    # print ("clean_up, before: $i\t$nmon[$i]\n");
    $x =~ s/\%/Pct/g;
    # $x =~ s/\W*//g;
    $x =~ s/\/s/ps/g;       # /s  - ps
    $x =~ s/\//s/g;     # / - s
    $x =~ s/\(/_/g;
    $x =~ s/\)/_/g;
    $x =~ s/ /_/g;
    $x =~ s/-/_/g;
    $x =~ s/_KBps//g;
    $x =~ s/_tps//g;
    $x =~ s/[:,]*\s*$//;
    $retval=$x; 

} # end clean up


##########################################
##  Extract headings from nmon csv file ##
##########################################
sub initialize {

%MONTH2NUMBER =  ("jan", 1, "feb",2, "mar",3, "apr",4, "may",5, "jun",6, "jul",7, "aug",8, "sep",9, "oct",10, "nov",11, "dec",12 );

@MONTH2ALPHA =  (   "junk","jan", "feb", "mar", "apr", "may", "jun", "jul", "aug", "sep", "oct", "nov", "dec" );

} # end initialize

# Get data from nmon file, extract specific data fields (hostname, date, ...)
sub get_nmon_data {

my $key;
my $x;
my $category;
my %toc;
my @cols;

# Read nmon file
unless (open(FILE, $FILENAME)) { return(1); }
@nmon=<FILE>;  # input entire file
close(FILE);
chomp(@nmon);

# Cleanup nmon data remove trainig commas and colons
for($i=0; $i<@nmon;$i++ ) {
    $nmon[$i] =~ s/[:,]*\s*$//;
}

# Get nmon/server settings (search string, return column, delimiter)
$AIXVER     =&get_setting("AIX",2,",");
$DATE       =&get_setting("date",2,",");
$HOSTNAME   =&get_setting("host",2,",");
$INTERVAL   =&get_setting("interval",2,","); # nmon sampling interval

$MEMORY     =&get_setting(qq|lsconf,"Good Memory Size:|,1,":");
$MODEL      =&get_setting("modelname",3,'\s+');
$NMONVER    =&get_setting("version",2,",");

$SNAPSHOTS  =&get_setting("snapshots",2,",");  # number of readings

$STARTTIME  =&get_setting("AAA,time",2,",");
($HR, $MIN)=split(/\:/,$STARTTIME);


if ($AIXVER eq "-1") {
    $SN=$HOSTNAME;  # Probably a Linux host
} else {
    $SN =&get_setting("systemid",4,",");
    $SN     =(split(/\s+/,$SN))[0]; # "systemid IBM,SN ..."
}

$TYPE       =&get_setting("^BBBP.*Type",3,",");
if ( $TYPE =~ /Shared/ ) { $TYPE="SPLPAR"; } else { $TYPE="Dedicated"; }

$MODE       =&get_setting("^BBBP.*Mode",3,",");
$MODE       =(split(/: /, $MODE))[1];
# $MODE     =~s/\"//g;


# Calculate UTC time (seconds since 1970)
# NMON V9  dd/mm/yy
# NMON V10+ dd-MMM-yyyy

if ( $DATE =~ /[a-zA-Z]/ ) {   # Alpha = assume dd-MMM-yyyy date format
    ($DAY, $MMM, $YR)=split(/\-/,$DATE);
    $MMM=lc($MMM);
    $MON=$MONTH2NUMBER{$MMM};
} else {
    ($DAY, $MON, $YR)=split(/\//,$DATE);
    $YR=$YR + 2000;
    $MMM=$MONTH2ALPHA[$MON];
} # end if

## Calculate UTC time (seconds since 1970).  Required format for the rrdtool.

##  timelocal format
##    day=1-31
##    month=0-11
##    year = x -1900  (time since 1900) (seems to work with either 2006 or 106)

$m=$MON - 1;  # jan=0, feb=2, ...

$UTC_START=timelocal(0,$MIN,$HR,$DAY,$m,$YR); 
$UTC_END=$UTC_START + $INTERVAL * $SNAPSHOTS;

@ZZZZ=grep(/^ZZZZ,/,@nmon);
for ($i=0;$i<@ZZZZ;$i++){

    @cols=split(/,/,$ZZZZ[$i]);
    ($DAY,$MON,$YR)=split(/-/,$cols[3]);
    $MON=lc($MON);
    $MON="00" . $MONTH2NUMBER{$MON};
    $MON=substr($MON,-2,2);
    $ZZZZ[$i]="$YR-$MON-$DAY $cols[2]";
    $DATETIME{$cols[1]}="$YR-$MON-$DAY $cols[2]";


} # end ZZZZ

return(0);
} # end get_nmon_data

它几乎(我说几乎是因为最近的NMON版本,当没有数据存在时它有时会有一些问题)完成这项工作,如果我将它用于这些部分它会更快地执行我的shell脚本

这就是为什么我认为perl应该是一个完美的解决方案。

当然,我不会要求任何人将我的shell脚本转换为perl中的最终版本,但至少要让我指向正确的方向: - )

我真的非常感谢任何人的帮助!

1 个答案:

答案 0 :(得分:0)

通常情况下,我强烈反对这样的问题,但是我们的生产系统已经关闭,直到它们被修复,我真的没那么多事情要做......

以下是一些可能让您入门的代码。请考虑它是伪代码,因为它是完全未经测试的,可能甚至不会编译(我总是忘记一些parantheses或分号,正如我所说,可以运行代码的实际机器无法访问)但我评论了很多,希望你会成为能够根据您的实际需要对其进行修改并使其运行。

use strict;
use warnings;

open INFILE, "<", "path/to/file.nmon";      # Open the file.

my @topLines;                               # Initialize variables.
my %timestamps;

while <INFILE>                              # This will walk over all the lines of the infile.
{                                           # Storing the current line in $_.
    chomp $_;                               # Remove newline at the end.
    if ($_ =~ m/^TOP/)                      # If the line starts with TOP...
    {
        push @topLines, $_;                 # ...store it in the array for later use.
    }
    elsif ($_ =~ m/^ZZZZ/)                  # If it is in the ZZZZ section...
    {
        my @fields = split ',', $_;         # ...split the line at commas...
        my $timestamp = join ",", $fields(2), $fields(3);   # ...join the timestamp into a string as you wish...
        $timestamps{$fields(1)} = $timestamp;               # ...and store it in the hash with the Twhatever thing as key.
    }

# This iteration could certainly be improved with more knowledge
# of how the file looks. For example the search could be cancelled
# after the ZZZZ section if the file is still long.
}

close INFILE;

open OUTFILE, ">", "path/to/output.csv";    # Open the file you want your output in.

foreach (@topLines)                         # Iterate through all elements of the array.
{                                           # Once again storing the current value in $_.
    my @fields = split ',', $_;             # Probably not necessary, depending on how output should be formated.
    my $outstring = join ',', $fields(0), $fields(1), $timestamps{$fields(2)};  # And whatever other fields you care for.
    print OUTFILE $outstring, "\n";         # Print.
}
close OUTFILE;
print "Done.\n";