如何使用Perl Text :: CSV基于重复字段组合CSV行?

时间:2014-07-09 17:06:13

标签: perl csv

我想编写一个Perl脚本:

  1. 定期监视输入CSV文件的文件目录
  2. 在检测到文件时,打开,读取和合并第二个字段/列具有相同值的多行
  3. 将更新后的CSV文件写入新目录,最后
  4. 删除输入文件。
  5. 例如,我有一个包含以下信息的CSV文件:

    "101","5555555555","DOE, JOHN "," DOE, JOHN, your trip
    tomorrow from, 123 Anywhere St Apt #A, to, 100 ELSEWHERE RD APT E, is
    scheduled for pickup between, 1:00 PM, and 1:30 PM"
    
    "102","5555555555","DOE, JOHN "," DOE, JOHN, your trip
    tomorrow from, 100 ELSEWHERE RD APT E, to, 123 Anywhere St Apt #A, is
    scheduled for pickup between, 9:00 PM, and 9:30 PM"
    

    我希望脚本能够读取,解析和检测第二个字段(“5555555555”)的重复值,然后创建一个新的CSV文件,将上述记录合并为一个记录:

    "101","5555555555","DOE, JOHN "," DOE, JOHN, your trip
    tomorrow from, 123 Anywhere St Apt #A, to, 100 ELSEWHERE RD APT E, is
    scheduled for pickup between, 1:00 PM, and 1:30 PM AND your trip
    tomorrow from, 100 ELSEWHERE RD APT E, to, 123 Anywhere St Apt #A, is
    scheduled for pickup between, 9:00 PM, and 9:30 PM"
    

    我当前的Perl代码成功检测,读取和解析文件,但是,我在如何检测重复项并组合行时迷失了方向。

    #!
    use strict;
    use warnings;
    use File::Find;
    use Text::CSV;
    
    $| = 1;
    
    use constant {
        #Check for CSV files only
        SUFFIX_LIST => qr/\.(csv)$/,
        DIR_TO_CHECK => "/Users/Me/Desktop/INBOUND/",
    };
    
    my @file_list;
    
    while (1) {
    
        #Recursively search the input directory for CSV files
        find ( sub {
                return unless -f;
                return unless $_ =~ SUFFIX_LIST;
    
                    #Make sure all of the files in the file list array are unique
                    if(!(grep(/^$_$/, @file_list))) {
                        push @file_list, $File::Find::name;
                    }
               }, DIR_TO_CHECK 
        );
    
    #If .csv files are found...
    if (scalar(@file_list) > 0) {
        print "\nNew Item in Directory\n";
    
        parseFile($file_list[0]);
    
        #Delete input file
        unlink $file_list[0];
    
        print "Deleted File\n";
    
        #Remove the file from the file list
        shift @file_list;
    } else {
    
        print "No New Item\n";
    
    }
    
    sleep 5;
    }
    
    #Subroutine to parse and compare the csv file
    sub parseFile() {
    
    my $csv = Text::CSV->new({ sep_char     => ',',
                           always_quote => 1,
                           quote_char   => '"',
                           escape_char  => '"',
                           binary       => 1,
                           auto_diag    => 1});
    
    #Get the file that was passed to the function
    my $file = $_[0] or die "CSV file not passed in subroutine\n";
    
    #Open file for reading
    open(my $data, '<', $file) or die "Could not open '$file' $!\n";
    
    while (my $line = <$data>) {
    
        print $line;
    
        if ($csv->parse($line)) {
    
            my @fields = $csv->fields();
    
        } else {
    
            #warn "Line could not be parsed: $line\n";
            Text::CSV->error_input();
        }
    }
    
    close $data;
    }
    

    我认为我所寻找的功能有什么不对,因为我怀疑我需要将文件作为一个整体读入内存,而不是逐行读取。请帮助,谢谢。

2 个答案:

答案 0 :(得分:0)

这一天我不是perl,但这是我的答案。使用第二个字段作为键创建哈希表。像这样。

%hashtbl{555555} = {
                    id => 102,                         # first field 
                    names => "doe, john",              # third field
                    msg => "DOE, JOHN, your trip..."   # last field 
                    };

如果密钥已存在于哈希表中,则附加其msg

if(exists $hashtbl[$KEY]) 
    $hashtbl{$KEY}->{msg} .= "AND $last_field"

读完整个文件后,使用此哈希表创建一个新的csv文件。

答案 1 :(得分:0)

这样的事情应该有效。

它并不完美,但它应该会有很大的提升。例如,您需要添加一些垃圾来删除展平描述列中的额外名称。

my $data = parseFile($path);
flatten_record($_) for @$data;
writeFile($newpath, $data);


sub csv_cols { qw/ id phone name desc / ) }

sub get_csv {
    my $csv = Text::CSV->new({
        sep_char     => ',',
        always_quote => 1,
        quote_char   => '"',
        escape_char  => '"',
        binary       => 1,
        auto_diag    => 1
    });
}


#Subroutine to parse csv file
sub parseFile() {
    my ($file) = @_;    
    die "CSV file not passed in subroutine\n"
         unless $file;

    my $csv = get_csv();

    #Open file for reading
    open(my $fh, '<', $file)
         or die "Could not open '$file' $!\n";

    $csv->column_names( csv_cols() );

    # make hash of arrays containing 
    my %by_phone;
    for my $row ( @{$csv->getline_hr_all($fh)} ) {
        my $phone = $row->{phone}
        $by_phone{$phone} = [] unless $by_phone{$phone};
        push @{$by_phone{$phone}}, $row;
    }

    return [ values %by_phone ];
}


sub flatten_record {
    my ($record) = @_;

    die "Empty record." if @$record == 0;

    if ( @$record == 1 ) {
         $record = $record->[0];
    } else {
         $record = {
             id    => $record->[0]{id},
             phone => $record->[0]{phone},
             name  => $record->[0]{name},
             desc  => "$record->[0]{desc} AND $record->[1]{desc}",
         };
    }

    return $record;
}

sub writeFile {
    my ( $path, $data ) = @_;

    open my $fh, ">", $path
        or die "Error opening '$path' for writing- $!\n";

    my $csv = get_csv();

    for my $record ( $data ) {
        my @row = @{$record}{ csv_cols() };
        $csv->print( $fh, \@row );
    }
}