我有两个包含多列的txt文件。这是第一个文件($frequency
)的样子:
C1 C2 A a B b C c D d
text 1 0 1 0 0 0 0 0 0
text 2 1 0 5 4 0 0 0 0
text 3 0 0 0 0 10 11 3 6
text 4 1 0 9 4 0 2 0 0
text 5 5 3 0 0 6 7 4 0
因此C2包含1到20000之间的所有位置。列A-d包含全部等于或大于0的数字值。
这是第二个文件($variants
)的样子
C1 C2 C3 C4
text 2 A D
text 4 B C
text 5 A B,D
C2这里包含一些介于1和20000之间的值.C3和C4包含A-D之间的字母(如表1中的列名,但都是大写字母)。我现在要做的是:将$variants
中C2中的值与来自$frequency
的C2中的值匹配,然后检查$variants
中C3中的哪个字母,然后复制从$frequency
到$variants
中的两个新列的相应值(因此正确的行和正确的大写和小写字母列)。然后需要对$variants
的C4进行相同的操作。
编辑:有时$variants
中的C4包含两个由','分隔的字母。对于这两个字母,$frequency
的值应出现在输出
基于此示例
,这就是输出的样子C1 C2 C3 C4 C5 C6 C7 C8 C9 C10
text 2 A D 1 0 0 0 empty
text 4 B C 9 4 0 2 empty
text 5 A B,D 5 3 0 0 4 0
我已经开始使用该脚本了,但是我仍然处在需要比较值和字母的位置。
这是我到目前为止所做的:
my $table1 = prompt("Give the name of the file with variants:\n");
open(my $variants, '<',$table1) || die "Could not open file $table1 $!";
my $table2 = prompt("Give the name of the file with the frequencies: \n");
open(my $frequency, '<',$table2) || die "Could not open file $table2 $!";
my (@position, @A, @a, @B, @b, @C, @c, @D, @d); #instead of using hashes I was trying to put all the values in arrays, because I don't know how to hash multiple columns from a file.
while(<$frequency>){
my @column = split(/\t/); # split on tabs
$position[$_] .= "$column[1] "; # I want to assign the correct column values to the arrays
$Afor[$_] .= "$column[2] ";
$arev[$_] .= "$column[3] ";
$Bfor[$_] .= "$column[4] ";
$brev[$_] .= "$column[5] ";
$Cfor[$_] .= "$column[6] ";
$crev[$_] .= "$column[7] ";
$Dfor[$_] .= "$column[8] ";
$drev[$_] .= "$column[9] ";
}
while(<$variants>){
next if /^\s*#/; # skipping some lines
next if /^\s*"/;
chomp;
my ($chr, $pos, $refall, $altall) = split;
}
我不确定这是否是正确的方法,因为我现在无法弄清楚如何检查$frequencies
中的正确行和相应列。有人可以帮我吗?
答案 0 :(得分:3)
最重要的第一步通常是选择正确的数据结构来保存数据。我认为频率文件内容最简单的结构就是一个哈希数组。像这样:
use strict; use warnings;
use English '-no_match_vars';
my ($variants_file, $frequency_file) = @ARGV; # take filename from command line
open my $variants, '<', $variants_file or die "Could not open file $variants_file: $!";
open my $frequency, '<', $frequency_file or die "Could not open file $frequency_file $!";
# parse the header fields
my (undef, undef, @header) = do {
my $header_line = <$frequency>;
chomp $header_line;
split /\t/, $header_line;
};
my @frequency_data;
my $expect_pos = 1; # starting position
while (<$frequency>){
chomp;
my(undef, $pos, @column) = split /\t/; # split on tabs
unless ($pos == $expect_pos) {
die "On line $INPUT_LINE_NUMBER: expected data for position $expect_pos, instead found position $pos";
}
@{ $frequency_data[$pos] }{@header} = @column;
++$expect_pos;
}
然后可以通过位置和字母轻松访问频率数据:
<$variants>; # throw away header
while(<$variants>){
next if /^\s*[#\"]/; # skipping some lines
chomp;
my ($text, $pos, $refall, $altall) = split;
my @ref_data = @{ $frequency_data[$pos] }{$refall, lc($refall)};
my @alt_data = @{ $frequency_data[$pos] }{$altall, lc($altall)};
print join("\t", $text, $pos, @ref_data, @alt_data), "\n";
}
根据您对问题的最新修改($variants
中的多个列),上述代码段可以推广为:
<$variants>;
while (<$variants>) {
next if ...
chomp;
my ($text, $pos, @cols) = split /\t/;
my @data = map {@{ $frequency_data[$pos] }{$_, lc $_}} # column to values
map { split /,/ } @cols; # split cols at comma
print join("\n", $text, $pos, @cols, @data), "\n";
}
我希望这有用。