Question

我的数据集（csv文件）包含大量正在进行多项测试的人员。如果任何测试完成，测试日期将在那里。在第二列中，有个人所属组织的代码。我们有大约40个独特的代码/组织。我正在尝试在大型数据集中检查所有唯一代码，然后为每个组织（即唯一代码）获取包含所有组织特定数据的文件。注意我还需要以月 - 而不是日 - 月的形式获取日期。这里有我的数据集的样子：

Patient,Code,test1,test2,test3 
P1,072,,25-Mar-14,                                          
P2,072,29-May-14,,                                           
P3,073,,03-Jan-14,                                  
P4,074,,,16-Feb-14                                           
P5,075,,09-Jul-14,                                          
P6,075,08-Jun-14,,

输出应该是这样的：包含标题的072文件，如下所示：

Patient,Code,test1,test2,test3 
P1,072,,25-Mar-14,                                          
P2,072,29-May-14,,

另一个073标题的文件看起来类似于：

Patient,Code,test1,test2,test3 
P3,073,,03-Jan-14,

等

这里是我写的代码。它保存了唯一代码，并使用组织特定代码创建了一个csv文件，但没有用适当的信息填充每个文件（仅与该特定组织有关的数据，用mm-yy代替dd-mm-yy。任何人都可以告诉我什么代码有问题吗？

use feature ':5.12';
use strict;
use warnings;
use autodie;

my $dataset          = 'R:/dataset/';
my $output_directory = 'R:/results/';

open my $infh, '<', "$dataset/CH_dataset.csv";

my %codes = ();
while (<$infh>) {
    chomp;
    my @columns = split ",";
    print "$columns[1]\n" if !$codes{ $columns[1] }++;
    my @unique_codes = keys %scodes;

    foreach my $unique_codes (@unique_codes) {
        open my $outfh, ">>", "$output_directory/CH_$unique_codes\_v$version.$update.csv";
        print $outfh $_
            if (/"$unique_codes"/
            and s/\d\d\-Jan\-/Jan\-/g | s/\d\d\-Feb\-/Feb\-/g | s/\d\d\-Mar\-/Mar\-/g | s/\d\d\-Apr\-/Apr\-/g
            | s/\d\d\-May\-/May\-/g | s/\d\d\-Jun\-/Jun\-/g | s/\d\d\-Jul\-/Jul\-/g | s/\d\d\-Aug\-/Aug\-/g
            | s/\d\d\-Sep\-/Sep\-/g | s/\d\d\-Oct\-/Oct\-/g | s/\d\d\-Nov\-/Nov\-/g | s/\d\d\-Dec\-/Dec\-/g );
    }
}

感谢您的帮助！

Answer 1

我曾经有过类似的任务。我使用哈希来保存所有必需的文件句柄并关闭他们都在退出之前。如果您的数据符合严格的格式，则以下代码可以使用。

use 5.14.0;
use Carp;

my $infile = $ARGV[0];

my %fh;    # this hash will have your codes as keys and the
           # corresponding filehandles as values.

# {{{ Open the infile and work
open( INFILE, "<$infile" ) or croak("Could not open $infile");
my $lineCnt = 0;
my $header  = readline(INFILE);    # skip the first line.
chomp($header);
my $justOpened = 0;
while ( my $line = readline(INFILE) ) {
    chomp($line);
    if ( $line =~ m/^\s*\#/ or $line =~ m/^\s*$/ ) { next; }
    my @ll = split( /,/, $line );
    my $code = $ll[1];
    my $dmy;
    for my $temp (@ll) {
        if ( $temp =~ m/\d{2}-\w{3}-\d{2}/ ) {
            $dmy = $temp;
        }
    }
    my @dmy = split( /-/, $dmy );
    my $nmy = $dmy[1] . '-' . $dmy[2];
    $line =~ s/$dmy/$nmy/;
    unless ( exists( $fh{$code} ) ) {
        my $fn = "code" . $code . '.csv';
        open( $fh{$code}, ">", $fn );
        $justOpened = 1;
    }
    select( $fh{$code} );
    if ($justOpened) {
        print("$header\n");
        $justOpened = 0;
    }
    print("$line\n");
}
close(INFILE);
# }}}

# {{{ close all the filehandles before exiting.
for my $handle ( values(%fh) ) {
    close($handle);
}
# }}}

exit;

Answer 2

您的目标描述很明确。但是，您的代码看起来很不正确。

我不是试图解析编程出错的地方，而是要展示我将如何解决问题：

use feature ':5.12';
use strict;
use warnings;
use autodie;

my $dataset          = 'R:/dataset/';
my $output_directory = 'R:/results/';

#open my $infh, '<', "$dataset/CH_dataset.csv";
my $infh = \*DATA;

my $header = <$infh>;

my %codes = ();
while (<$infh>) {
    chomp;
    my $code = ( split ',' )[1];

    #my $outfile = "$output_directory/CH_${code}_v$version.$update.csv";
    my $outfile = "CH_${code}.csv";

    my $outfh;
    if ( !-e $outfile ) {
        open $outfh, '>', $outfile;
        print $outfh $header;
    } else {
        open $outfh, '>>', $outfile;
    }

    # Remove Day of Month
    s/\d{2}-(?=(?:Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec)-\d{2})//g;

    print $outfh "$_\n";
}

__DATA__
Patient,Code,test1,test2,test3 
P1,072,,25-Mar-14,                                          
P2,072,29-May-14,,                                           
P3,073,,03-Jan-14,                                  
P4,074,,,16-Feb-14                                           
P5,075,,09-Jul-14,                                          
P6,075,08-Jun-14,,

输出4个文件：

$ ls CH_07*
CH_072.csv  CH_073.csv  CH_074.csv  CH_075.csv

$ cat CH_07*
Patient,Code,test1,test2,test3 
P1,072,,Mar-14,                                          
P2,072,May-14,,                                           
Patient,Code,test1,test2,test3 
P3,073,,Jan-14,                                  
Patient,Code,test1,test2,test3 
P4,074,,,Feb-14                                           
Patient,Code,test1,test2,test3 
P5,075,,Jul-14,                                          
P6,075,Jun-14,,

在perl中应用2个grep条件

2 个答案: