将许多大型文本文件放在一起......我的第一行由于某种原因未对齐

时间:2013-12-03 23:20:00

标签: arrays regex perl reference

最终编辑:感谢我在这里的一些输入,我解决了我的问题!项目完成!以下代码有效。它的速度相当快〜约100个txt文件(大约2000行和10列),但任何提高速度的建议都会很酷!

我有几个包含一年数据的文件夹。在这些文件夹中有许多txt文件。

在顶行是标题名称。第二行是单位。两个标题可以是相同的,但是因为单位不同而不同。在某些年份我们收集了某种数据(例如浊度),但在其他年份我们没有。

示例数据:

EX1

DateTime    Temp    SpCond  Salinity    DO% DO Conc DO Charge   Depth   pH  pHmV    Chlorophyll %Fluorescence
M/D/Y   C   mS/cm   ppt %   mg/L        m       mV  ug/L    %FS
2/17/2009 14:01 2.79    45.303  28.45   124.4   13.87   46  1.092   8.56    -93.4   4.7 1.1

EX2

Date/Time   Temp    Speci Cond  Salinity    DO  DO  DO Charge   Depth   pH  pH
mm/dd/yyyy hh:mm:ss °C  mS/cm   PPT %   mg/L    DOchrg  meters  pH  mV
7/13/2010 13:31 23.52   46.821  30.44   72.8    5.19    39  9.369   7.69    -46.3

输出:

Date/Timemm/dd/yyyy hh:mm:ss    Temperature°C  Specific CondmS/cm  SalinityPPT DO% DOmg/L  DO ChargeDOchrg Depthmeters pHpH    pHmV    Chlorophyllug/L ChlorophyllRFU  Temperature°C   ConductivitymS/cm   ResistivityKOhm.cm  TDSg/L  Densityg/cm3    

1/15/2010 13:30 2.41    49.78   31.49   129.7   14.31   98  1.108   8.08    -85.6   7.7 1.8 -9999   -9999   -9999   -9999   -9999
1/15/2010 13:45 2.26    49.708  31.42   126.7   14.03   98  1.104   8.08    -85.7   9.1 2.2 -9999   -9999   -9999   -9999   -9999
1/15/2010 14:00 2.23    49.664  31.38   126.3   14  99  1.092   8.1 -86.5   8.5 2   -9999   -9999   -9999   -9999   -9999
1/15/2010 14:15 2.19    49.685  31.39   125.1   13.88   97  1.091   8.11    -87 8.3 2   -9999   -9999   -9999   -9999   -9999
1/15/2010 14:30 2.22    49.703  31.41   125.3   13.89   99  1.105   8.11    -87.5   8.4 2   -9999   -9999   -9999   -9999   -9999

代码

#!/usr/bin/perl



#procedure
#1-find all unique headers in each file
#2-put the data from each file into a new one that is defined by the unique headers

use Tie::File;  #each txt file is represented as an array

my @tog=(); #where I will store the headers and units I find
my @lines=();
{
     opendir my $CWD, '.' or die "opendir .: $!\n";

    my @files = grep /\.txt$/i, readdir $CWD; #read the txt file
    closedir $CWD;
    for (@files) {
        tie my @lines, 'Tie::File', $_ or die $!;
            my @headers = split(/\t/,$lines[0]); 
            my @units=split(/\t/,$lines[1]);        
            for( my $i=0 ; $i<=$#headers; $i++){
            my $one= join "",$headers[$i],$units[$i];
            chomp($one);
            push(@tog,$one);
            }
            }


}

#1-get the unique headers
my %seen;
@tog = grep { ! $seen{$_}++ } @tog; #get the unique headers of all the files in the folder
@tog = grep {$_} @tog;

my $UH=@tog;
my @headers=();
#create a new file's headers name based on unique header name
for( my $f=0; $f<=$UH; $f++){
    print "$tog[$f] \t $f\n"; #when I do this I see that I haven't gotten rid of the blank ones!
    push(@headers, $tog[$f]); # create header based on unique variable
}


open my $fh, '>', 'DATAEXPORT.txt' or die "Could not open file: $!"; #declare your function handle fh. this will do the writing
print $fh join("\t", @headers), "\n";

#2-put the data from each file into a new one that is defined by the unique headers
{    
    opendir my $CWD, '.' or die "opendir .: $!\n";
    my @files = grep /\.txt$/i, readdir $CWD; #read the txt file
    closedir $CWD;      
    for (@files) {
    my @search=();  
        tie my @lines, 'Tie::File', $_ or die $!;
            my @headers = split(/\t/,$lines[0]); 
            my @units=split(/\t/,$lines[1]);    
            for( my $i=0 ; $i<=$#units; $i++){
            my $one= join "",$headers[$i],$units[$i];
            chomp($one);
            push(@search,$one);
            }       
        my @expr=@tog;
        @pattern= grep {$_} @search;
#this is the array that contains the headers and units of the particular file I am looking at       
            #Now that I have read what matches the expression, I should use these things to write into a txt file

    my $Nlines=$#lines; #grab the number of lines you will be working with  
        for( my $j=0; $j<=$Nlines; $j++){
            $rownum=$j;     
                my @dataline_array=split(/\t/,$lines[$j]);
                my @datarow=();
                for(my $i=0; $i<=$#expr; $i++){                             
                    $found=0;
                    for(my $ii=0; $ii<=$#pattern; $ii++){ #Do this until you are cycling through all data points                                                                                    
                        if ($pattern[$ii] =~ m/$expr[$i]/){                         
                            $found=1;
                            chomp($dataline_array[$ii]);
                            push(@datarow,$dataline_array[$ii]);
                        }
                    }                   
                    if($found eq 0){ #if we looked through all of them, and didn't find a match
                    push(@datarow,'-9999');
                    }
                    undef $found;                   
                    #loop through each expression                
                }
                #do this for every row you write                
                open(my $fh, '>>', 'DATAEXPORT.txt') or die "Could not open file '$filename' $!"; #open an append to bottom of file
                print $fh join("\t",@datarow), "\n";
                close $fh;

                undef @datarow;

        }
        #Now that we have gone through all of our lines, lets print out doc 
        undef @expr;
        undef @pattern;     
        }
}   

1 个答案:

答案 0 :(得分:0)

如果在拆分文件标题时获得空白字段,则代替

my @headers = grep { /\S/ } split(/\t/,$lines[0]);
my @units=grep { /\S/ } split(/\t/,$lines[1]);

试试这个:

my @headers = split /\s*\t\s*/, $lines[0];
my @units = split /\s*\t\s*/, $lines[0];

我的猜测是,某些字段分隔符的空格与标签混合在一起,这使得很难将它们干净地解析为@headers@units。正则表达式\s*\t\s*表示“包含至少一个选项卡的空格字符串”,因此它应该自动为您删除任何无关的空格。

一些一般性的建议和想法:

  • 使用Tie::File表示如果修改@lines的其中一个元素,则该更改将更新文件的内容。我不认为这就是你想要做的。如果没有,那么这样做可能会更加安全:

    open(FILE, $_) or die $!;
    @lines = <FILE>;
    close FILE;
    
  • 看起来@headers数组与@tog之间没有任何区别。您可以删除将@tog复制到@headers

  • 的代码
  • 在每个文件中结合@headers@units的代码可以移动到子程序中,以便更容易理解您的代码:

    sub combined_headers_and_units {
        my ($header_line, $units_line) = @_;
        chomp $header_line;
        chomp $units_line;
        my @headers = split /\s*\t\s*/, $header_line;
        my @units = split /\s*\t\s*/, $units_line;
        my @combined = ();
        for( my $i=0 ; $i <= $#headers; $i++) {
            my $one= join ",",$headers[$i],$units[$i];
            push @combined, $one;
        }
        return @combined;
    }
    

    然后:

    for (@files) {
        open (FILE, $_) or die $!;
        my $header_line = <FILE>;
        my $units_line = <FILE>;
        close FILE;
        push @tog, combined_headers_and_units($header_line, $units_line);
    }
    

我很乐意提出其他建议,但很难理解你要对自己拥有的数据做些什么。如果您可以更具体地描述您的目标是什么以及您希望输出看起来像什么,我们可以尝试为您提供更具体的建议来解决问题。