这个解析脚本可以更快地制作吗?他们是瓶颈吗?

时间:2013-12-30 11:41:40

标签: perl

我在目录中有一组关于10.000个文件的文件。所有文件的总大小约为1GB。解析(包括插入到Mysql)1个文件,大约需要2分钟。如果它像这样进展,我可能需要13天才能解析这些文件。肯定有些不对劲,因为这不正常。我有8GB内存,我的规格还不错。

这是我的代码:

#!/usr/bin/perl -w

use strict;
use warnings;
use DBI; # Load the DBI module for connection to mysql database (or others)
use Time::localtime; # Load Time::localtime module used to convert dates and times
use Net::IP;
my @datasetarray;
my $Dir = "/root/updates\/processed";

opendir my($dh), $Dir or die "Could not open directory [$Dir]: $!";

foreach my $file ( sort { $a cmp $b } readdir $dh )
    {
    next if $file eq "." or $file eq "..";

    print "$file\n";
    unless ($file=~/\.hr$/){next;}


    my $file = $Dir."/".$file;  

    open (IN, $file) or die "error reading file: ", $file,"\n";


    my $record_id = "";
        my $time="";
    my $type = "";
    my $peer_ip = "";
    my $peer_as = "";
    my $local_ip = "";
    my $local_as = "";
    my $next_hop = "";
    my @nodes_and_index = ();
    my @withdraw_prefix = ();
    my @announce_prefix = ();
    my $tmphash = {};

    while (<IN>) {          
        no warnings 'uninitialized';

        if (/^TIME/) {

            if ($type) {$tmphash->{'type'} = $type;}
            if ($peer_ip) {$tmphash->{'peer_ip'} = $peer_ip;}
            if ($peer_as) {$tmphash->{'peer_as'} = $peer_as;}
            if ($local_ip) {$tmphash->{'local_ip'} = $local_ip;}
            if ($local_as) {$tmphash->{'local_as'} = $local_as;}
            if ($next_hop) {$tmphash->{'next_hop'} = $next_hop;}
            if (@nodes_and_index) {push @{$tmphash->{'nodes_and_index'}}, @nodes_and_index;}  
            if (@withdraw_prefix) {push @{$tmphash->{'withdraw_prefix'}}, @withdraw_prefix;}
            if (@announce_prefix) {push @{$tmphash->{'announce_prefix'}}, @announce_prefix;}


            if ($record_id) {  
                        $tmphash->{'time'} = $record_id;
                        push @datasetarray, $tmphash;
                        $tmphash = {};
                    }        

            $peer_as = "";
            $peer_ip = "";
            $type = "";
            $local_ip = "";
            $local_as = "";
            $next_hop = "";
            $record_id = "";
            $time="";
            @nodes_and_index = ();
            @withdraw_prefix = ();
            @announce_prefix = ();


            my @time = split '\s', $_;
            $record_id = $time[1]." ".$time[2];


        } elsif (/^TYPE/) {
            my @type_tmp = split '\s', $_;
            $type = $type_tmp[1];

        } elsif (/^FROM/) {
            my @from_tmp = split '\s', $_;
            $peer_ip = $from_tmp[1];
            $peer_as = $from_tmp[2];
            $peer_as =~ s/AS//; 

        } elsif (/^TO/) {
            my @to_tmp = split '\s', $_;
            $local_ip = $to_tmp[1];
            $local_as = $to_tmp[2];
            $local_as =~ s/AS//;

        } elsif (/^ASPATH/) {

            my @nodes_tmp = split '\s', $_;
                shift @nodes_tmp;       
            my $index = 0;

            foreach my $node (@nodes_tmp) {
                    $index++;
            push @nodes_and_index, [$node , $index]; 
             }  

        }elsif (/^NEXT_HOP/) {  

            my @next_hop_tmp = split '\s', $_;
            $next_hop = $next_hop_tmp[1];  

        }elsif (/^WITHDRAW/) {
            while (<IN>) {       
                     last if !/^ +/;  
                     push @withdraw_prefix, [$_] ;           

                 }

        }elsif (/^ANNOUNCE/) {

                 while (<IN>) {        
                     last if !/^ +/;
                     push @announce_prefix, [$_];

                 }  

            }


    }
    close IN;


    if ($type) {$tmphash->{'type'} = $type;} 
        if ($peer_ip) {$tmphash->{'peer_ip'} = $peer_ip;}
        if ($peer_as) {$tmphash->{'peer_as'} = $peer_as;} 
        if ($local_ip) {$tmphash->{'local_ip'} = $local_ip;} 
        if ($local_as) {$tmphash->{'local_as'} = $local_as;} 
        if ($next_hop) {$tmphash->{'next_hop'} = $next_hop;} 
        if (@nodes_and_index) {push @{$tmphash->{'nodes_and_index'}}, @nodes_and_index;}  
        if (@withdraw_prefix) {push @{$tmphash->{'withdraw_prefix'}}, @withdraw_prefix;}  
        if (@announce_prefix) {push @{$tmphash->{'announce_prefix'}}, @announce_prefix;}  

        if ($record_id) {  
             $tmphash->{'time'} = $record_id; 
             push @datasetarray, $tmphash;
            $tmphash = {};
        }  

 databaseloader(); #Call database loader subroutine;
 @datasetarray=();


}  




###########################################DATABASE INSERTS########################################################

sub databaseloader {

my $hostname;
my $username;
my $password;
my $database_name;
my $error_log;
if(@ARGV > 0) {
    GetOptions('h|hostname=s' => \$hostname,
           'u|user=s' => \$username,
           'p|password=s' => \$password,
           'db|database_name=s' => \$database_name,
           'e|error_log=s' => \$error_log
           );
}

#defaults
if(! defined $hostname) {
    $hostname = "localhost";
}
if(! defined $username) {
    $username = "root";
}
if(! defined $password) {    
    $password = "admin";
}
if(! defined $database_name) {    
    $database_name = "BGPstorage";
}
if(! defined $error_log) {
    `touch /tmp/error_log`;
    $error_log = "/tmp/error_log";
}

#print "making connection to database named $database_name on $hostname with user: $username and password: $password\n" ;

#connect to mysql database
my $dbh = DBI->connect( "dbi:mysql:$database_name:$hostname", $username, $password, {
      PrintError => 1,
      RaiseError => 0
  } ) or die "Can't connect to the database: $DBI::errstr\n";



                 ### Prepare SQL statements ###



### Update details table --> information about the UPDATE message (Update_ID,Time,Type,Peer_IP)
my $Update_detail= $dbh->prepare_cached("INSERT IGNORE INTO update_detail VALUES(NULL,?,?,?)" ) or die "Can't prepare SQL statement: $DBI::errstr\n";


### Announce updates table --> information about the announce UPDATE message (Announce_UpdateID,Prefix,Update_ID)
my $Announce_update= $dbh->prepare_cached("INSERT IGNORE INTO announce_update VALUES(NULL,?,?,?)" ) or die "Can't prepare SQL statement: $DBI::errstr\n";


### Withdraw updates table --> information about the withdraw UPDATE message (Withdraw_UpdateID,Prefix,Update_ID)
my $Withdraw_update= $dbh->prepare_cached("INSERT IGNORE INTO withdraw_update VALUES(NULL,?,?,?)" ) or die "Can't prepare SQL statement: $DBI::errstr\n";

### AS PATH table --> information about the Autonomous system PATHS (AS_Path_ID,Path_Index,AS_No,Update_ID) i.e. (001,1,2321| 002,1,322)
my $as_path = $dbh->prepare_cached("INSERT IGNORE INTO as_path VALUES(NULL,?,?,?)" ) or die "Can't prepare SQL statement: $DBI::errstr\n";



#Define Variables

my $TIME;
my $TYPE;
my $PEERAS;
my $PEERIP;
my $LOCALAS;
my $LOCALIP;
my $MYNEXTHOP;
my @WITHDRAWALS;
my @ANNOUNCED;
my $UpdateKey; #Get the last updated key value
foreach my $row (@datasetarray) {
    no warnings 'uninitialized';



    my $TIME = $row->{'time'} ;
    my $TYPE = $row->{'type'} ;
    my $PEERAS = $row->{'peer_as'};
    my $PEERIP = $row->{'peer_ip'};
    my $LOCALAS = $row->{'local_as'};
    my $LOCALIP = $row->{'local_ip'};
    my $MYNEXTHOP = $row->{'next_hop'};
    my @ASPATH = @{$row->{'nodes_and_index'}} if ref $row->{'nodes_and_index'} eq 'ARRAY';
    my @WITHDRAWALS = @{$row->{'withdraw_prefix'}} if ref $row->{'withdraw_prefix'} eq 'ARRAY';
    my @ANNOUNCED = @{$row->{'announce_prefix'}} if ref $row->{'announce_prefix'} eq 'ARRAY';



    #INSERT INTO UPDATES TABLE
    my $mysql_dt = sprintf('20%3$s-%1$s-%2$s %4$s', split(/[\/ ]/, $TIME));
    $Update_detail->execute($mysql_dt,$TYPE,$PEERIP);  #Insert into Update_detail table
    $UpdateKey = $Update_detail->{mysql_insertid}; #Get primary key of last inserted statement

    #INSERT INTO AS PATH TABLE
    foreach my $as (@ASPATH) {  
            no warnings 'uninitialized';
            $as_path->execute($as->[1],$as->[0],$UpdateKey);

        } 

    #INSERT INTO ANNOUNCE TABLE
    foreach my $au (@ANNOUNCED) {
            no warnings 'uninitialized';
                        my $val=$au->[0];
            $val=~s/^\s+//;   #To remove leading whitespace in the IP           
            my $prefix = new Net::IP ($val) or die (Net::IP::Error());
            my $IP = $prefix->ip();
            my $subnetmask = $prefix->mask();
            $Announce_update->execute($IP,$subnetmask,$UpdateKey);

        } 

    #INSERT INTO WITHDRAW TABLE
    foreach my $wd (@WITHDRAWALS) {
            no warnings 'uninitialized';
            my $val=$wd->[0];
            $val=~s/^\s+//;   #To remove leading whitespace in the IP
            my $prefix = new Net::IP ($val) or die (Net::IP::Error());
            my $IP = $prefix->ip(); 
            my $subnetmask = $prefix->mask();
            $Withdraw_update->execute($IP,$subnetmask,$UpdateKey);
        } 




}

}
print "\nInsertion Completed\n";    

文件文本结构示例:

TIME: 07/27/13 09:00:00
TYPE: BGP4MP/MESSAGE/Update
FROM: 10.255.9.4 AS172193
TO: 10.255.9.10 AS676767
WITHDRAW
  10.27.236.0/24

TIME: 07/27/13 09:00:00
TYPE: BGP4MP/MESSAGE/Update
FROM: 10.255.9.4 AS172193
TO: 10.255.9.10 AS676767
ORIGIN: IGP
ASPATH: 172193 19601 14835 4758 15731 410 913 72 2113 7659 5024
NEXT_HOP: 10.255.9.126
ANNOUNCE
  10.27.236.0/24

TIME: 07/27/13 09:00:02
TYPE: BGP4MP/MESSAGE/Update
FROM: 10.255.9.4 AS172193
TO: 10.255.9.10 AS676767
ORIGIN: IGP
ASPATH: 172193 19601 14835 3352 3687 7196 14384 15037 9486 8580
NEXT_HOP: 10.255.9.126
ANNOUNCE
  10.2.86.0/24
  10.2.92.0/24

2 个答案:

答案 0 :(得分:2)

不要在每次循环迭代时重新连接到数据库

执行此操作的一种简单方法是在文件顶部的“use”声明下添加变量my $dbh=0,然后在databaseloader sub中更改

中的connect语句
my $dbh = DBI->connect( "dbi:mysql:$database_name:$hostname", $username, $password, {
      PrintError => 1,
      RaiseError => 0
  } ) or die "Can't connect to the database: $DBI::errstr\n";

这样的事情

if (not($dbh)) {
$dbh = DBI->connect( "dbi:mysql:$database_name:$hostname", $username, $password, {
          PrintError => 1,
          RaiseError => 0
      } ) or die "Can't connect to the database: $DBI::errstr\n";
}

只连接一次将获得巨大的性能提升。可能你想要移动所有getOpt东西和默认值,但是缓存数据库句柄就是胜利。

答案 1 :(得分:1)

运行Devel::NYTProf等分析工具并查看输出是确定此问题的最佳方法。将-D:NYTProf添加到第一行,运行脚本,然后运行nytprofhtml将显示程序花费最多时间的位置。然后,如果您发现它花费了大量时间连接到数据库或运行某些特定的正则表达式,那么您可以缩小您的工作范围。