Question

我有一个非常大的文件，包含开始和结束位置，但这里有一个片段：

(A)   11897   11976           
(B)   17024   18924         
(C)   25687  25709

和另一个具有开始和结束位置的文件（也是一个片段）：

(i) 3631 5899  
(ii) 11649 13714                                       
(iii) 23146 31227

我想知道值文件2是否包含文件1中值的开始和结束位置。

我想要的结果文件如下所示：

(ii) 11649 18924 (A) 11897 11976      
(iii) 23145 31277 (C) 25687 25709

我写了一个perl代码：

open my $firstfile, '<', $ARGV[0] or die "$!";
open my $secondfile, '<', $ARGV[1] or die "$!";

while (<$firstfile>) {
    @col=split /\s+/;
    $start=$col[1];
    $end= $col[2];

    while (<$secondfile>) {
        @seccol=split /\s+/;
        $begin=$seccol[1];
        $finish=$seccol[2];     

        print join ("\t", @col, @seccol), "\n" if ($start>=$begin and $end<=$finish);
    }
}

但我的结果文件只显示第一个匹配，但没有显示其他匹配：

(ii) 11649 18924 (A) 11897 11976

有什么建议吗？

Answer 1

因为您正在使用嵌套循环，所以在外部循环的第一次迭代之后，第二个文件已被完全使用。您可以创建一个包含第一个文件中元素的数组，然后将它们与第二个文件进行比较，而不是重新读取文件：

use strict;
use warnings;
use autodie;

open my $firstfile, '<', $ARGV[0];
open my $secondfile, '<', $ARGV[1];

my @range;

while (<$firstfile>) {
    push @range, [ split ];
}

while (<$secondfile>) {
    my @col = split;
    my @matches = grep {
        $$_[1] >= $col[1] && $$_[2] <= $col[2]
    } @range;

    if (@matches > 0) {
        for my $ref (@matches) {
            print join("\t", @$ref, @col), "\n";
        }
    }
}

@range是对第一个文件中列的引用数组。请注意，您不需要为split指定任何其他参数，因为默认情况下它会在空格上分割。

在第二个while循环中，将第二个文件的每一列与@range数组中引用的每组值进行比较。任何匹配都存储在@matches中。如果数组的大小大于0，则每个匹配的打印格式与最初指定的格式相同。

Answer 2

您需要每次回放第二个文件，或者（最好根据其大小）将其加载到数组中。

#!/usr/bin/perl
use strict;
use warnings;

my ($start,$end,$begin,$finish);

open my $firstfile, '<', $ARGV[0] or die "$!";
open my $secondfile, '<', $ARGV[1] or die "$!";

while (<$firstfile>) {
        my @col=split /\s+/;
        $start=$col[1];
        $end= $col[2];

        seek($secondfile,0,0);
        while (<$secondfile>) {
           my @seccol=split /\s+/;
           $begin=$seccol[1];
           $finish=$seccol[2];
           print join ("\t", @col, @seccol), "\n" if ($start>=$begin and $end<=$finish);
        }
}

Answer 3

这是另一个perl单行：

perl -lane '
BEGIN { 
    $x = pop;
    push @range, map[split], <>; 
    @ARGV = $x
}  
for (@range) {
    if ($F[1] <= $_->[1] && $F[2] >= $_->[2]) {
        print join " ", @F, @$_
    }
}' bigfile secondfile
(ii) 11649 13714 (A) 11897 11976
(iii) 23146 31227 (C) 25687 25709

使用命令行选项：

-l从每行删除换行符并在打印期间将其放回
-a会自动将该行拆分为数组@F。
-n创建一个while(<>){..}循环来处理每一行
-e执行代码块
在BEGIN块中，我们遍历大文件，创建一个数组数组
在主体中我们检查第二列和第三列是否在范围内，如果是，我们打印整行和整个数组内容数组。

perl：将数值与两个文件中的范围匹配

3 个答案: