通过Perl从数据集中提取所需的列

时间:2013-10-23 02:42:51

标签: perl dataset

我有2个文件,其中file1(sample.txt)是样本ID列表(大约1000个)。这些样本ID是file2(sampleValue.txt)中的列名。 file2是30000 * 1500的数据矩阵。我感兴趣的是1500列中1000列中所有行的值,如1,2,5,6,70,71,75,100,112,114等。列上没有图案。所以,这就是我正在做的事情,并想知道如何改进它。这是我的代码:

## Opening first file
open my $IN, "sample.txt" or die $!;
my $header = <$IN>;

while(<$IN>){
chomp $_;
my @line = split('\t', $_);
$sampleID{$line[0]} = 1; ## Sample ID
}
close($IN);
print "Total number of sample ID: ", scalar(keys %sampleID),"\n"; ## 1000 columns

## Sample Value Data
open $IN, "sampleValue.txt" or die $!;

## Columns are sample names from file1
$header = <$IN>;
my @samples = split("\t", $header); ## 
print "Total samples: ",scalar(@samples),"\n"; ## 1500

## loop for all the samples ids or the columns I am interested in
for(my $i = 1; $i <= $#samples; $i++){ ## bcos the first instance is called header of the column 1
my $sample = $samples[$i];
$sampleValue{$sample} = $i if (exists $sampleID{$sample});
}

my $col = "";  
foreach my $key (keys %sampleValue){
$col = $sampleValue{$key}.",".$col;
}
chop($col);
print $col,"\n"; ## string of all the columns I am interested in

我之所以进行上述循环,是因为我不想在逐行读取文件的同时通过哈希查找感兴趣的列。

## Reading the sample Value file row by row
while(<$IN>){
chomp $_;
print $_,"\n";
my @line = split("\t", $_);
@line = @line[$col]; ## error since it is string type
print @line,"\n";
}

我收到了行@line = @line[$col];的错误,因为$ line是一个字符串而不是数字。但是,如果你做@line[1,2,5,6,70,71,75,100,112,114],它就有效。所以,我的问题是,是否有一种简单的方法可以将字符串$col转换为带逗号的数字列,或者更好的方法来获取所需的列?

0 个答案:

没有答案