Question

我再次需要你的帮助！

将文件标签分隔如下：

chr10.10.2      scaffold1116    94.92   394     13      1       16      409     10474   10860   4.1e-201        697.0
chr10.10.2      scaffold1116    100.00  14      0       0       1       14      10453      10466   1.9e+01 27.0
…………………………

和另一个这样的文件：

chr10.10.1      283
chr10.10.2      409
chr10.10.3      572
chr10.10.4      248
chr10.10.5      143
…………………………

我想根据第二个文件的编号保留第一个文件中的一些特定行。

例如，如果我必须使用＆＃34; chr10.10.2＆＃34;保留该行，我必须检查＆＃34; chr10.10.2＆＃34;在第二个文件。我写了一个脚本，但由于这两个文件非常大，需要花费很多时间。（对于第一个文件的每一行，它搜索第二个文件的所有行）。有没有办法以更有效的方式搜索第二个文件？

这是我的代码：

#!/usr/bin/perl
use strict;
use warnings;

my $blat_out = $ARGV[0];
my $sizes    = $ARGV[1];

#Cheking the output of "HCEs Vs Genomes" alignments (blat) based on the sizes of the HCEs....

open my $blat_file, $blat_out or die "Could not open $blat_out: $!";
while ( my $line = <$blat_file> ) {
    chomp $line;
    # while( my $size_line = <$size_file>)  {
    if ( $line =~ m/^chr/ ) {
        my @lines = split( '\t', $line );
        #my @size_lines = split('\t', $size_line);
        my $hce        = $lines[0];
        #print "$hce\n";
        my $scaf       = $lines[1];
        my $persent    = $lines[2];
        my $al_length  = $lines[3];
        my $hce_start  = $lines[6];
        my $hce_end    = $lines[7];
        my $scaf_start = $lines[8];
        my $scaf_end   = $lines[9];
        my $score      = $lines[10];
        open my $size_file, $sizes or die "Could not open $sizes: $!";

        while ( my $size_line = <$size_file> ) {
            chomp $size_line;
            my @size_lines = split( '\t', $size_line );
            my $hce_name   = $size_lines[0];
            my $hce_size   = $size_lines[1];
            #print "$hce_size\n";

            if ( $hce eq $hce_name ) {
                my $al_ratio = $al_length / $hce_size;
                if ( ( $persent >= 98 ) && ( $al_ratio >= 0.9 ) ) {
                    print "$line\n";    #print only the lines that satisfies the previous criteria
                }

            }
        }
        #close $size_file;
    }
}

非常感谢您提前，瓦西利斯。

Answer 1

我建议将 $ size_file 存储在内存中（哈希），这样您就不需要为 $ blat_file 的每一行打开它。那个I / 0很重。

您可以创建自己的脚本来执行此操作，也可以使用 File::Slurp 模块。

加分：您还可以使用Text::CSV_XS模块加快解析速度，使用制表符作为分隔符而不是逗号。

此外，这是不相关的，但是仅供参考，您可以转换这些行：

my $hce        = $lines[0];
my $scaf       = $lines[1];
my $persent    = $lines[2];
my $al_length  = $lines[3];
my $hce_start  = $lines[6];
my $hce_end    = $lines[7];
my $scaf_start = $lines[8];
my $scaf_end   = $lines[9];
my $score      = $lines[10];

成：

my ($hce, $scaf, $persent, $al_length, undef, undef, $hce_start, $hce_end, $scaf_start, $scaf_end, $score) = @lines;

Answer 2

如何使用存储第二个文件的哈希：

# Build hash of hce_name => hce_size
my %size = do {
    open my $fh, '<', $sizes or die "Could not open $sizes: $!";
    map { chomp; split "\t", $_, 2 } <$fh>;
};

open my $blat_file, '<', $blat_out or die "Could not open $blat_out: $!";
while ( my $line = <$blat_file> ) {
    chomp $line;

    next if $line !~ m/^chr/;

    my @fields     = split "\t", $line;
    my $hce        = $fields[0];
    my $scaf       = $fields[1];
    my $persent    = $fields[2];
    my $al_length  = $fields[3];
    my $hce_start  = $fields[6];
    my $hce_end    = $fields[7];
    my $scaf_start = $fields[8];
    my $scaf_end   = $fields[9];
    my $score      = $fields[10];

    next if !exists $size{$hce};

    my $al_ratio = $al_length / $size{$hce};
    if ( $persent >= 98 && $al_ratio >= 0.9 ) {
        print "$line\n";    #print only the lines that satisfies the previous criteria
    }
}

Answer 3

如果两个文件都非常大，则不要使用哈希表。使用排序。

首先，根据第一列对两个文件进行排序：

$ sort -k 1,1 first.tsv > first.sorted
$ sort -k 1,1 second.tsv > second.sorted

然后逐行浏览第一个和第二个文件，寻找两者之间的匹配。

当有匹配项时，打印它们 - 否则，遍历第一个或第二个文件，具体取决于字符串比较结果：

#!/usr/bin/perl

use strict;
use warnings;

my $firstFn = "first.sorted";
my $secondFn = "second.sorted";
open my $firstFh, "<", $firstFn or die "could not open first file\n";
open my $secondFh, "<", $secondFn or die "could not open second file\n";
my $firstLine = <$firstFh>;
chomp $firstLine;
my @firstElems = split("\t", $firstLine);
my $firstChr = $firstElems[0];
while (<$secondFh>) {
    chomp;
    my ($secondChr, $secondNum) = split("\t", $_);

    #
    # Test *chr string equality: 
    #
    #  1. If secondChr is less than ("lt") firstChr, then we
    #     retrieve the next secondChr.
    #
    #  2. If secondChr is the same as ("eq") firstChr, then we 
    #     print out the first file's current line and retrieve the 
    #     next line from the first file, then re-test.
    #
    #  3. If secondChr is greater than ("gt") firstChr, then we
    #     retrieve the next line from the first file until there
    #     is a match.
    #

    if ($secondChr lt $firstChr) {
        next;
    }
    while ($secondChr eq $firstChr) {
        print STDOUT "$firstLine\n";
        $firstLine = <$firstFh>;
        chomp $firstLine;
        @firstElems = split("\t", $firstLine);
        $firstChr = $firstElems[0];
    }
    while ($secondChr gt $firstChr) {
        $firstLine = <$firstFh>;
        chomp $firstLine;
        @firstElems = split("\t", $firstLine);
        $firstChr = $firstElems[0];
        while ($secondChr eq $firstChr) {
            print STDOUT "$firstLine\n";
            $firstLine = <$firstFh>;
            chomp $firstLine;
            @firstElems = split("\t", $firstLine);
            $firstChr = $firstElems[0];
        }
    }
}
close $secondFh;
close $firstFh;

这是未经测试的，但我认为它应该有用（或至少说明会让你接近）。

这种方法优于使用哈希表的优点是，您只需要足够的内存来存储两行，每行一个。除非你的线路也很长，否则你的内存开销现在几乎不是问题。如果你有非常大的文件，这可能是一个重要的优势。

缺点是排序两个（大）文件的前期时间成本。但是，如果其中一个文件没有改变，如果你经常在两个文件之间进行查找，一些排序时间可以快速摊销。

如何在循环搜索时优化内部？

3 个答案: