匹配多个csv文件中的行并合并特定字段

时间:2010-07-29 17:30:24

标签: perl bash scripting csv

我有大约20个CSV,看起来像这样:

"[email]","[fname]","[lname]","[prefix]","[suffix]","[fax]","[phone]","[business]","[address1]","[address2]","[city]","[state]","[zip]","[setdate]","[email_type]","[start_code]"

我被告知需要生成的内容完全相同,但现在每个文件都包含电子邮件匹配的其他文件中的start_code。

如果任何其他字段不匹配并不重要,只需要电子邮件字段很重要,并且每个文件的唯一更改是添加电子邮件匹配的其他文件中的任何其他start_code值。 / p>

例如,如果wicq.csv,oota.csv和itos.csv中出现相同的电子邮件,则会在每个文件中显示以下内容:

"anon@yahoo.com","anon",,,,,,,,,,,,01/16/08 08:05 PM,,"WIQC PDX"
"anon@yahoo.com","anon",,,,,,,,,,,,01/16/08 08:05 PM,,"OOTA"
"anon@yahoo.com","anon",,,,,,,,,,,,01/16/08 08:05 PM,,"ITOS"

"anon@yahoo.com","anon",,,,,,,,,,,,01/16/08 08:05 PM,,"WIQC PDX, OOTA, ITOS"

所有三个文件(wicq.csv,oota.csv和itos.csv)

我可用的工具是OS X命令行(awk,sed等)以及perl - 尽管我对它们都不太熟悉,并且可能有更好的方法来做到这一点。

3 个答案:

答案 0 :(得分:1)

use strict;
use warnings;
use Text::CSV_XS;

# Supply csv files as command line arguments.
my @csv_files = @ARGV;
my $parser    = Text::CSV_XS->new;

# In my test data, the email is the first field. The field
# to be merged is the second. Adjust accordingly.
my $EMAIL_i   = 0;
my $MERGE_i   = 1;

# Process all files, creating a set of key-value pairs:
#    $sc{EMAIL} = [ LIST OF VALUES OBSERVED IN THE MERGE FIELD ]
my %sc;
for my $cf (@csv_files){
    open(my $fh_in, '<', $cf) or die $!;

    while (my $line = <$fh_in>){
        die "Failed parse : $cf : $.\n" unless $parser->parse($line);
        my @fields = $parser->fields;
        push @{ $sc{$fields[$EMAIL_i]} }, $fields[$MERGE_i];
    }
}

# Process the files again, writing new output.
for my $cf (@csv_files){
    open(my $fh_in,  '<', $cf)             or die $!;
    open(my $fh_out, '>', "${cf}_new.csv") or die $!;

    while (my $line = <$fh_in>){
        die "Failed parse : $cf : $.\n" unless $parser->parse($line);
        my @fields = $parser->fields;

        $fields[$MERGE_i] = join ', ', @{ $sc{$fields[$EMAIL_i]} };

        $parser->print($fh_out, \@fields);
        print $fh_out "\n";
    }
}

答案 1 :(得分:0)

我会通过以下方式来做到这一点:

cut -d ',' -f1,16 *.csv | 
    sort |
    awk -F, '{d=""; if (array[$1]) d=","; array[$1] = array[$1] d $2} END { for (i in array) print i "," array[i]}' |
    while IFS="," read -r email start; do sed -i "/^$email,/ s/,[^,]*\$/,$start/" *.csv; done

这会创建所有电子邮件的列表(cut / sort)和start_codes并合并(awk)。然后,它会替换(sed)每个文件中每个匹配电子邮件的start_code(while)。

但我觉得必须有一种更有效的方式。

答案 2 :(得分:0)

这是一个简单的Perl程序,可以满足您的需求。它通过依赖事先对其进行排序的事实对您的输入进行单次传递。

它会读取行并添加代码,因为电子邮件不会更改。当电子邮件发生变化时,它会打印记录(并在代码字段中修复额外的双引号)。

#!/usr/bin/perl -l

use strict;
use warnings;

my $last_email = undef;
my @current_record = ();
my @fields = ();

sub print_record {
   # Remove repeated double quotes introduced when we appended the code
  $current_record[15] =~ s/""/, /g;
  print join ",", @current_record;
  @current_record = ();
} 

while (my $input_line = <>) {
  chomp $input_line;
  @fields = split ",", $input_line;

  # Print a record when the email we read changes. Avoid printing on the first
  # loop by checking we have read at least one email ($last_email is defined).
  defined $last_email && ($fields[0] ne $last_email) && print_record;

  if (!@current_record)  {
    # We are starting to process a new email. Grab all fields.
    @current_record = @fields;
  }
  else {
    # We have consecutive records with the same email. Append the code.
    $current_record[15] .= $fields[15];
  }

  # Remember the last processed email. When it changes we will print @current_record.
  $last_email = $fields[0];
}

# Print the last record
print_record

-l开关有print自动添加一个新行char(无论os是什么)。

这样称呼:

sort *.csv | ./script.pl