最终编辑:感谢我在这里的一些输入,我解决了我的问题!项目完成!以下代码有效。它的速度相当快〜约100个txt文件(大约2000行和10列),但任何提高速度的建议都会很酷!
我有几个包含一年数据的文件夹。在这些文件夹中有许多txt文件。
在顶行是标题名称。第二行是单位。两个标题可以是相同的,但是因为单位不同而不同。在某些年份我们收集了某种数据(例如浊度),但在其他年份我们没有。
示例数据:
EX1
DateTime Temp SpCond Salinity DO% DO Conc DO Charge Depth pH pHmV Chlorophyll %Fluorescence
M/D/Y C mS/cm ppt % mg/L m mV ug/L %FS
2/17/2009 14:01 2.79 45.303 28.45 124.4 13.87 46 1.092 8.56 -93.4 4.7 1.1
EX2
Date/Time Temp Speci Cond Salinity DO DO DO Charge Depth pH pH
mm/dd/yyyy hh:mm:ss °C mS/cm PPT % mg/L DOchrg meters pH mV
7/13/2010 13:31 23.52 46.821 30.44 72.8 5.19 39 9.369 7.69 -46.3
输出:
Date/Timemm/dd/yyyy hh:mm:ss Temperature°C Specific CondmS/cm SalinityPPT DO% DOmg/L DO ChargeDOchrg Depthmeters pHpH pHmV Chlorophyllug/L ChlorophyllRFU Temperature°C ConductivitymS/cm ResistivityKOhm.cm TDSg/L Densityg/cm3
1/15/2010 13:30 2.41 49.78 31.49 129.7 14.31 98 1.108 8.08 -85.6 7.7 1.8 -9999 -9999 -9999 -9999 -9999
1/15/2010 13:45 2.26 49.708 31.42 126.7 14.03 98 1.104 8.08 -85.7 9.1 2.2 -9999 -9999 -9999 -9999 -9999
1/15/2010 14:00 2.23 49.664 31.38 126.3 14 99 1.092 8.1 -86.5 8.5 2 -9999 -9999 -9999 -9999 -9999
1/15/2010 14:15 2.19 49.685 31.39 125.1 13.88 97 1.091 8.11 -87 8.3 2 -9999 -9999 -9999 -9999 -9999
1/15/2010 14:30 2.22 49.703 31.41 125.3 13.89 99 1.105 8.11 -87.5 8.4 2 -9999 -9999 -9999 -9999 -9999
代码
#!/usr/bin/perl
#procedure
#1-find all unique headers in each file
#2-put the data from each file into a new one that is defined by the unique headers
use Tie::File; #each txt file is represented as an array
my @tog=(); #where I will store the headers and units I find
my @lines=();
{
opendir my $CWD, '.' or die "opendir .: $!\n";
my @files = grep /\.txt$/i, readdir $CWD; #read the txt file
closedir $CWD;
for (@files) {
tie my @lines, 'Tie::File', $_ or die $!;
my @headers = split(/\t/,$lines[0]);
my @units=split(/\t/,$lines[1]);
for( my $i=0 ; $i<=$#headers; $i++){
my $one= join "",$headers[$i],$units[$i];
chomp($one);
push(@tog,$one);
}
}
}
#1-get the unique headers
my %seen;
@tog = grep { ! $seen{$_}++ } @tog; #get the unique headers of all the files in the folder
@tog = grep {$_} @tog;
my $UH=@tog;
my @headers=();
#create a new file's headers name based on unique header name
for( my $f=0; $f<=$UH; $f++){
print "$tog[$f] \t $f\n"; #when I do this I see that I haven't gotten rid of the blank ones!
push(@headers, $tog[$f]); # create header based on unique variable
}
open my $fh, '>', 'DATAEXPORT.txt' or die "Could not open file: $!"; #declare your function handle fh. this will do the writing
print $fh join("\t", @headers), "\n";
#2-put the data from each file into a new one that is defined by the unique headers
{
opendir my $CWD, '.' or die "opendir .: $!\n";
my @files = grep /\.txt$/i, readdir $CWD; #read the txt file
closedir $CWD;
for (@files) {
my @search=();
tie my @lines, 'Tie::File', $_ or die $!;
my @headers = split(/\t/,$lines[0]);
my @units=split(/\t/,$lines[1]);
for( my $i=0 ; $i<=$#units; $i++){
my $one= join "",$headers[$i],$units[$i];
chomp($one);
push(@search,$one);
}
my @expr=@tog;
@pattern= grep {$_} @search;
#this is the array that contains the headers and units of the particular file I am looking at
#Now that I have read what matches the expression, I should use these things to write into a txt file
my $Nlines=$#lines; #grab the number of lines you will be working with
for( my $j=0; $j<=$Nlines; $j++){
$rownum=$j;
my @dataline_array=split(/\t/,$lines[$j]);
my @datarow=();
for(my $i=0; $i<=$#expr; $i++){
$found=0;
for(my $ii=0; $ii<=$#pattern; $ii++){ #Do this until you are cycling through all data points
if ($pattern[$ii] =~ m/$expr[$i]/){
$found=1;
chomp($dataline_array[$ii]);
push(@datarow,$dataline_array[$ii]);
}
}
if($found eq 0){ #if we looked through all of them, and didn't find a match
push(@datarow,'-9999');
}
undef $found;
#loop through each expression
}
#do this for every row you write
open(my $fh, '>>', 'DATAEXPORT.txt') or die "Could not open file '$filename' $!"; #open an append to bottom of file
print $fh join("\t",@datarow), "\n";
close $fh;
undef @datarow;
}
#Now that we have gone through all of our lines, lets print out doc
undef @expr;
undef @pattern;
}
}
答案 0 :(得分:0)
如果在拆分文件标题时获得空白字段,则代替
my @headers = grep { /\S/ } split(/\t/,$lines[0]);
my @units=grep { /\S/ } split(/\t/,$lines[1]);
试试这个:
my @headers = split /\s*\t\s*/, $lines[0];
my @units = split /\s*\t\s*/, $lines[0];
我的猜测是,某些字段分隔符的空格与标签混合在一起,这使得很难将它们干净地解析为@headers
和@units
。正则表达式\s*\t\s*
表示“包含至少一个选项卡的空格字符串”,因此它应该自动为您删除任何无关的空格。
一些一般性的建议和想法:
使用Tie::File
表示如果修改@lines
的其中一个元素,则该更改将更新文件的内容。我不认为这就是你想要做的。如果没有,那么这样做可能会更加安全:
open(FILE, $_) or die $!;
@lines = <FILE>;
close FILE;
看起来@headers
数组与@tog
之间没有任何区别。您可以删除将@tog
复制到@headers
。
在每个文件中结合@headers
和@units
的代码可以移动到子程序中,以便更容易理解您的代码:
sub combined_headers_and_units {
my ($header_line, $units_line) = @_;
chomp $header_line;
chomp $units_line;
my @headers = split /\s*\t\s*/, $header_line;
my @units = split /\s*\t\s*/, $units_line;
my @combined = ();
for( my $i=0 ; $i <= $#headers; $i++) {
my $one= join ",",$headers[$i],$units[$i];
push @combined, $one;
}
return @combined;
}
然后:
for (@files) {
open (FILE, $_) or die $!;
my $header_line = <FILE>;
my $units_line = <FILE>;
close FILE;
push @tog, combined_headers_and_units($header_line, $units_line);
}
我很乐意提出其他建议,但很难理解你要对自己拥有的数据做些什么。如果您可以更具体地描述您的目标是什么以及您希望输出看起来像什么,我们可以尝试为您提供更具体的建议来解决问题。