Question

我有两个包含多列的txt文件。这是第一个文件（$frequency）的样子：

C1    C2    A  a   B   b   C   c   D   d
text   1    0  1   0   0   0   0   0   0
text   2    1  0   5   4   0   0   0   0
text   3    0  0   0   0   10  11  3   6
text   4    1  0   9   4   0   2   0   0
text   5    5  3   0   0   6   7   4   0

因此C2包含1到20000之间的所有位置。列A-d包含全部等于或大于0的数字值。

这是第二个文件（$variants）的样子

C1    C2    C3   C4  
text   2    A    D  
text   4    B    C 
text   5    A    B,D

C2这里包含一些介于1和20000之间的值.C3和C4包含A-D之间的字母（如表1中的列名，但都是大写字母）。我现在要做的是：将$variants中C2中的值与来自$frequency的C2中的值匹配，然后检查$variants中C3中的哪个字母，然后复制从$frequency到$variants中的两个新列的相应值（因此正确的行和正确的大写和小写字母列）。然后需要对$variants的C4进行相同的操作。

编辑：有时$variants中的C4包含两个由'，'分隔的字母。对于这两个字母，$frequency的值应出现在输出

中

基于此示例

，这就是输出的样子

C1    C2    C3    C4    C5   C6   C7   C8  C9  C10
text  2     A     D     1    0    0    0   empty  
text  4     B     C     9    4    0    2   empty
text  5     A     B,D   5    3    0    0    4   0

我已经开始使用该脚本了，但是我仍然处在需要比较值和字母的位置。

这是我到目前为止所做的：

my $table1 = prompt("Give the name of the file with variants:\n");
open(my $variants, '<',$table1) || die "Could not open file $table1 $!";

my $table2 = prompt("Give the name of the file with the frequencies: \n");
open(my $frequency, '<',$table2) || die "Could not open file $table2 $!";

my (@position, @A, @a, @B, @b, @C, @c, @D, @d); #instead of using hashes I was trying to put all the values in arrays, because I don't know how to hash multiple columns from a file.

while(<$frequency>){
    my @column = split(/\t/); # split on tabs
    $position[$_] .= "$column[1] "; # I want to assign the correct column values to the arrays
    $Afor[$_] .= "$column[2] ";
    $arev[$_] .= "$column[3] ";
    $Bfor[$_] .= "$column[4] ";
    $brev[$_] .= "$column[5] ";
    $Cfor[$_] .= "$column[6] ";
    $crev[$_] .= "$column[7] ";
    $Dfor[$_] .= "$column[8] ";
    $drev[$_] .= "$column[9] ";
}

while(<$variants>){
    next if /^\s*#/; # skipping some lines
    next if /^\s*"/;
    chomp;
my ($chr, $pos, $refall, $altall) = split;
}

我不确定这是否是正确的方法，因为我现在无法弄清楚如何检查$frequencies中的正确行和相应列。有人可以帮我吗？

Answer 1

最重要的第一步通常是选择正确的数据结构来保存数据。我认为频率文件内容最简单的结构就是一个哈希数组。像这样：

use strict; use warnings;
use English '-no_match_vars';

my ($variants_file, $frequency_file) = @ARGV; # take filename from command line

open my $variants,  '<', $variants_file   or die "Could not open file $variants_file: $!";
open my $frequency, '<', $frequency_file  or die "Could not open file $frequency_file $!";

# parse the header fields
my (undef, undef, @header) = do {
  my $header_line = <$frequency>;
  chomp $header_line;
  split /\t/, $header_line;
};

my @frequency_data;
my $expect_pos = 1; # starting position
while (<$frequency>){
    chomp;
    my(undef, $pos, @column) = split /\t/; # split on tabs
    unless ($pos == $expect_pos) {
      die "On line $INPUT_LINE_NUMBER: expected data for position $expect_pos, instead found position $pos";
    }
    @{ $frequency_data[$pos] }{@header} = @column;
    ++$expect_pos;
}

然后可以通过位置和字母轻松访问频率数据：

<$variants>;  # throw away header
while(<$variants>){
    next if /^\s*[#\"]/; # skipping some lines
    chomp;
    my ($text, $pos, $refall, $altall) = split;
    my @ref_data = @{ $frequency_data[$pos] }{$refall, lc($refall)};
    my @alt_data = @{ $frequency_data[$pos] }{$altall, lc($altall)};
    print join("\t", $text, $pos, @ref_data, @alt_data), "\n";
}

根据您对问题的最新修改（$variants中的多个列），上述代码段可以推广为：

<$variants>;
while (<$variants>) {
  next if ...
  chomp;
  my ($text, $pos, @cols) = split /\t/;
  my @data = map {@{ $frequency_data[$pos] }{$_, lc $_}}  # column to values
             map { split /,/ } @cols;                     # split cols at comma
  print join("\n", $text, $pos, @cols, @data), "\n";
}

我希望这有用。

匹配两个文件之间的值（多个if函数）并粘贴值

1 个答案: