如何加快两个文件之间的模式匹配

时间:2014-09-15 08:46:54

标签: regex perl awk

输入file1:

col1    col2    col3    col4
ZGLP1   ICAM4   13.27   0.2425
ICAM4   ZGLP1   13.27   0.2425
RRP1B   CDH24   20.8    1
ZGLP1   OOEP    18.79   0.3060
ZGLP1   RRP1B   39.62   0.2972
ZGLP1   CDH24   51.21   0.2560
BBCDI   DND1    19.44   0.2833
BBCDI   SOHLH2  36.61   0.2909
DND1    SOHLH2  18      0.8

输入文件2:

chr8     18640000   18960000    ZGLP1   RRP1B   CDH24  #gene number here is not fixed can be #4 #5 or more
chr8     19000000   19080000    BBCDI   DND1    SOHLH2 #gene number here is not fixed can be #4 #5 or more

我编写了一个代码,它将file1的col1和col2与file2的每一行进行比较,这样,如果任何一对落在file2行的任何地方,那么程序应该打印"染色体pos1 pos2和匹配的内容file1的值为

输出文件:

chr8     18640000   18960000    ZGLP1   RRP1B 39.62 0.2972
chr8     18640000   18960000    ZGLP1 CDH24 51.21   0.2560
chr8     18640000   18960000    RRP1B CDH24 20.8    1
chr8     19000000   19080000    BBCDI   DND1 19.44  0.2833
chr8     19000000   19080000    BBCDI SOHLH2 36.61  0.2909
chr8     19000000   19080000    DND1 SOHLH2 18 0.8  

到目前为止,我已经尝试了这个,但由于我的输入文件非常庞大(2GB),所以需要花费很多时间。

我的perl代码

open( AB, "file1" ) || die("cannot open");
open( BC, "file2" ) || die("cannot open");
open( OUT, ">output.txt" );

@file = <AB>;

chomp(@file);
@data = <BC>;

chomp(@data);

foreach $fl (@file) {
    if ( $fl =~ /(.*?)\s+(.*?)\s+(.*?)\s+(.*)/ ) {
        $one = $1;
        $two = $2;
        $thr = $3;
        $for = $4;
    }

    foreach $line (@data) {
        if ( $line =~ /(.*?)\s+(.*?)\s+(.*?)\s+(.*)+/ ) {
            $chr  = $1;
            $pos1 = $2;
            $pos2 = $3;
        }

        if ( $line =~ /$one/ ) {
            if ( $line =~ /$two/ ) {
                print OUT $chr, "\t", $pos1, "\t", $pos2, "\t", $fl, "\n";
            }
        }
    }
}

2 个答案:

答案 0 :(得分:1)

加快代码速度的几种方法:

首先读入并解析文件1并创建索引:

my %ix;
while (<AB>) {
    # skip the first line (with the column headers)
    next if $. == 1;
    chomp;
    # assuming that the data is tab-separated; if not, you can run split /\s+/
    my @arr = split "\t";
    # create a hash with structure $ix{col1}{col2} = "col3  col4"
    $ix{ $arr[0] }{ $arr[1] } = $arr[2] . "\t" . $arr[3];
}

现在在文件2中读取,一次一行,并查找匹配项:

while (<BC>) {
    chomp;
    # initialise a set of variables all at once
    # assumes the data is tab-delimited; if it isn't, use split /\s+/
    my ($chr, $pos1, $pos2, $g1, $g2, $g3) = split "\t";

    # $g1, $g2, and $g3 are the three IDs on the line. This code assumes they will
    # always be in the order that they appear in file 1.
    # look for $g1 in our index. if ( $ix{$g1} ) is shorthand for checking if a
    # variable is defined and is non-zero.
    if ( $ix{$g1} ) {
        # now, for each of $g2 and $g3
        foreach my $g ($g2, $g3) {
            # ... check whether we've got an index entry where it is the second key
            if ( $ix{$g1}{$g} ) {
                # print out the data joined by tabs
                print OUT join("\t", $chr, $pos1, $pos2, $g1, $g, $ix{$g1}{$g}) . "\n";
            }
        }
    }
    # do the same check for $g2 and $g3. We have to check whether $ix{$g2} exists
    # first as if we check $ix{$g2}{$g3} directly and $ix{$g2} DOESN'T exist,
    # Perl will create it. This is known as autovivification.
    if ($ix{$g2} && $ix{$g2}{$g3}) {
        print OUT join("\t", $chr, $pos1, $pos2, $g2, $g3, $ix{$g2}{$g3}) . "\n";
    }
}

答案 1 :(得分:1)

$ cat tst.awk               
NR==FNR {
    if (NR>1)
        file1[$1,$2] = $0
    next
}
{
    for (i=3; i<=NF; i++)
        for (j=3; j<=NF; j++)
            if ( ($i,$j) in file1 )
                print $1, $2, $3, file1[$i,$j]
}
$ 
$ awk -f tst.awk file1 file2
chr8 18640000 18960000 ZGLP1   RRP1B   39.62   0.2972
chr8 18640000 18960000 ZGLP1   CDH24   51.21   0.2560
chr8 18640000 18960000 RRP1B   CDH24   20.8    1
chr8 19000000 19080000 BBCDI   DND1    19.44   0.2833
chr8 19000000 19080000 BBCDI   SOHLH2  36.61   0.2909
chr8 19000000 19080000 DND1    SOHLH2  18      0.8