Perl解析CSV文件"填写"和" null"领域

时间:2016-06-08 17:42:38

标签: perl parsing null

好的 - 我会发布我的整个剧本,因为当我不这样做时,我会受到惩罚 - 即使我上次这样做,我也因为发布整个剧本而受到严厉批评。我只需要知道我最初询问的那一行是否有效。完整的脚本(一直工作得很好,直到另一个部门给我的数据完全不同于我们最初告诉他们的数据)才能完成

我正在解析并清理CSV文件,以便可以将其加载到MySQL表中。它是通过其他人的批量Java程序和#34;如果任何字段为空,则批处理文件将停止并显示错误。

我被告知只要在任何记录中都有一个空字段,就放入一个空白区域。这项工作会简单吗?

if ( ! length $fields[2] ) { 
    $_ = ' ' for $fields[2];
}

有没有办法一次检查各种多个字段?或者更好的方法是检查所有字段(这是在分割记录之后),这是我在将记录写回CSV文件之前做的最后一件事。

这是整个脚本。请不要告诉我,我在已经运行的脚本中所做的事情并不是你怎么做的。 -

#!/usr/bin/perl/

use strict;
use warnings;
use Data::Dumper;
use Time::Piece;

my $filename = 'mistints_1505_comma.csv';
#my $filename = 'test.csv';

# Open input file
open my $FH, $filename
  or die "Could not read from $filename <$!>, program halting.";

# Open error handling file
open ( my $ERR_FH, '>', "errorFiles1505.csv" ) or die $!;

# Read the header line of the input file and print to screen.
chomp(my $line = <$FH>);
my @fields = split(/,/, $line);
print Dumper(@fields), $/;

my @data;

# Read the lines one by one.
while($line = <$FH>) {

    chomp($line);

# Scrub data of characters that cause scripting problems down the line.
    $line =~ s/[\'\\]/ /g;

# split the fields of each record

    my @fields = split(/,/, $line);

# Check if the storeNbr field is empty.  If so, write record to error file.
    if (!length $fields[28]) {
        chomp (@fields);
        my $str = join ',', @fields;
        print $ERR_FH "$str\n";
        }
    else
    {

# Concatenate the first three fields and add to the beginning of each record
    unshift @fields, join '_', @fields[28..30];

# Format the DATE fields for MySQL
    $_ = join '-', (split /\//)[2,0,1] for @fields[10,14,24,26];

# Scrub colons from the data
    $line =~ s/:/ /g;

# If Spectro_Model is "UNKNOWN", change
    if($fields[22] eq "UNKNOWN"){
        $_ = 'UNKNOW' for $fields[22];
        }

# If tran_date is blank, insert 0000-00-00
    if(!length $fields[10]){
        $_ = '0000-00-00' for $fields[10];
        }

# If init_tran_date is blank, insert 0000-00-00
    if(!length $fields[14]){
        $_ = '0000-00-00' for $fields[14];
        }

# If update_tran_date is blank, insert 0000-00-00
    if(!length $fields[24]){
        $_ = '0000-00-00' for $fields[24];
        }

# If cancel_date is blank, insert 0000-00-00
    if(!length $fields[26]){
        $_ = '0000-00-00' for $fields[26];
        }

# Format the PROD_NBR field by deleting any leading zeros before decimals.
    $fields[12] =~ s/^\s*0\././;

# put the records back
    push @data, \@fields;
}
}

close $FH;
close $ERR_FH;

print "Unsorted:\n", Dumper(@data); #, $/;

#Sort the clean files on Primary Key, initTranDate, updateTranDate, and updateTranTime
@data = sort {
    $a->[0] cmp $b->[0] ||
    $a->[14] cmp $b->[14] ||
    $a->[26] cmp $b->[26] ||
    $a->[27] cmp $b-> [27]
} @data;

#open my $OFH, '>', '/swpkg/shared/batch_processing/mistints/parsedMistints.csv';
open my $OFH, '>', '/swpkg/shared/batch_processing/mistints/cleaned1505.csv';
print $OFH join(',', @$_), $/ for @data;
close $OFH;

exit;

3 个答案:

答案 0 :(得分:1)

据我所知,你已经在逗号,上拆分了一条记录,并且你想要改变所有空字符串的字段以包含一个空格

我会写这个

use strict;
use warnings 'all';

my $record = 'a,b,c,,e,,g,,i,,k,,m,n,o,p,q,r,s,t';

my @fields = map { $_ eq "" ? ' ' : $_ } split /,/, $record;


use Data::Dump;
dd \@fields;

输出

[ "a", "b", "c", " ", "e", " ", "g", " ", "i", " ", "k", " ", "m" .. "t" ]

或者,如果某些字段需要设置为不同的字段(如果它们为空),则可以设置默认数组

看起来像这样。除了字段10,11和12(@defaults)之外,所有0000-00-00数组都设置为空格。这些是在分割记录后获取的

use strict;
use warnings 'all';

my @defaults = (' ') x 20;

$defaults[$_] = '0000-00-00' for 9, 10, 11;

my $record = 'a,b,c,,e,,g,,i,,k,,m,n,o,p,q,r,s,t';

my @fields = split /,/, $record;

for my $i ( 0 .. $#fields ) {
    $fields[$i] = $defaults[$i] if $fields[$i] eq '';
}


use Data::Dump;
dd \@fields;

输出

[ "a", "b", "c", " ", "e", " ", "g", " ", "i", "0000-00-00", "k", "0000-00-00", "m" .. "t" ]


看过你的完整节目后,我推荐这样的话。如果您已经显示了输入数据的样本,那么我可以使用哈希来引用列名而不是数字,从而使其更具可读性

#!/usr/bin/perl/

use strict;
use warnings 'all';

use Data::Dumper;
use Time::Piece;

my $filename = 'mistints_1505_comma.csv';
#my $filename = 'test.csv';

open my $FH, $filename
        or die "Could not read from $filename <$!>, program halting.";

open( my $ERR_FH, '>', "errorFiles1505.csv" ) or die $!;

chomp( my $line = <$FH> );
my @fields = split /,/, $line;    #/
print Dumper( \@fields ), "\n";

my @data;

# Read the lines one by one.
while ( <$FH> ) {

    chomp;

    # Scrub data of characters that cause scripting problems down the line.
    tr/'\\/  /;                   #'

    my @fields = split /,/;       #/

    # Check if the storeNbr field is empty.  If so, write record to error file.

    if ( $fields[28] eq "" ) {
        my $str = join ',', @fields;
        print $ERR_FH "$str\n";
        next;
    }

    # Concatenate the first three fields and add to the beginning of each record
    unshift @fields, join '_', @fields[ 28 .. 30 ];

    # Format the DATE fields for MySQL
    $_ = join '-', ( split /\// )[ 2, 0, 1 ] for @fields[ 10, 14, 24, 26 ];

    # Scrub colons from the data
    tr/://d;                      #/

    my $i = 0;
    for ( @fields ) {

        # If "Spectro_Model" is "UNKNOWN" then change to "UNKNOW"
        if ( $i == 22 ) {
            $_ = 'UNKNOW' if $_ eq 'UNKNOWN';
        }

        # If a date field is blank then insert 0000-00-00
        elsif ( grep { $i == $_ } 10, 14, 24, 26 ) {
            $_ = '0000-00-00' if $_ eq "";
        }

        # Format the PROD_NBR field by deleting any leading zeros before decimals.
        elsif ( $i == 12 ) {
            s/^\s*0\././;
        }

        # Change all remaining empty fields to a single space
        else {
            $_ = ' ' if $_ eq "";
        }

        ++$i;
    }

    push @data, \@fields;
}

close $FH;
close $ERR_FH;

print "Unsorted:\n", Dumper(@data);    #, $/;

#Sort the clean files on Primary Key, initTranDate, updateTranDate, and updateTranTime
@data = sort {
    $a->[0] cmp $b->[0]   or
    $a->[14] cmp $b->[14] or
    $a->[26] cmp $b->[26] or
    $a->[27] cmp $b->[27]
} @data;

#open my $OFH, '>', '/swpkg/shared/batch_processing/mistints/parsedMistints.csv';
open my $OFH, '>', '/swpkg/shared/batch_processing/mistints/cleaned1505.csv' or die $!;
print $OFH join(',', @$_), $/ for @data;
close $OFH;

答案 1 :(得分:0)

好吧,如果你在分成$fields之前就这样做了,你应该可以做类似的事情

# assuming a CSV line is in $_
#pad null at start of line
s/^,/ ,/;

#pad nulls in the middle
s/,,/, ,/g;

#pad null at the end
s/,$/, /;

答案 2 :(得分:0)

请勿尝试推出自己的CSV解析代码。使用Text::CSVText::CSV::Slurp

使用Text :: CSV,您可以执行类似

的操作
$line   = $csv->string();             # get the combined string
$status  = $csv->parse($line);        # parse a CSV string into fields
@columns = map {defined $_ ? $_ : " "} $csv->fields(); # get the parsed fields

你真的确定要用空格替换空值吗?我说如果字段未定义,则db中应为NULL。