Question

输入file1：

col1    col2    col3    col4
ZGLP1   ICAM4   13.27   0.2425
ICAM4   ZGLP1   13.27   0.2425
RRP1B   CDH24   20.8    1
ZGLP1   OOEP    18.79   0.3060
ZGLP1   RRP1B   39.62   0.2972
ZGLP1   CDH24   51.21   0.2560
BBCDI   DND1    19.44   0.2833
BBCDI   SOHLH2  36.61   0.2909
DND1    SOHLH2  18      0.8

输入文件2：

chr8     18640000   18960000    ZGLP1   RRP1B   CDH24  #gene number here is not fixed can be #4 #5 or more
chr8     19000000   19080000    BBCDI   DND1    SOHLH2 #gene number here is not fixed can be #4 #5 or more

我编写了一个代码，它将file1的col1和col2与file2的每一行进行比较，这样，如果任何一对落在file2行的任何地方，那么程序应该打印＆＃34;染色体pos1 pos2和匹配的内容file1的值为

输出文件：

chr8     18640000   18960000    ZGLP1   RRP1B 39.62 0.2972
chr8     18640000   18960000    ZGLP1 CDH24 51.21   0.2560
chr8     18640000   18960000    RRP1B CDH24 20.8    1
chr8     19000000   19080000    BBCDI   DND1 19.44  0.2833
chr8     19000000   19080000    BBCDI SOHLH2 36.61  0.2909
chr8     19000000   19080000    DND1 SOHLH2 18 0.8

到目前为止，我已经尝试了这个，但由于我的输入文件非常庞大（2GB），所以需要花费很多时间。

我的perl代码

open( AB, "file1" ) || die("cannot open");
open( BC, "file2" ) || die("cannot open");
open( OUT, ">output.txt" );

@file = <AB>;

chomp(@file);
@data = <BC>;

chomp(@data);

foreach $fl (@file) {
    if ( $fl =~ /(.*?)\s+(.*?)\s+(.*?)\s+(.*)/ ) {
        $one = $1;
        $two = $2;
        $thr = $3;
        $for = $4;
    }

    foreach $line (@data) {
        if ( $line =~ /(.*?)\s+(.*?)\s+(.*?)\s+(.*)+/ ) {
            $chr  = $1;
            $pos1 = $2;
            $pos2 = $3;
        }

        if ( $line =~ /$one/ ) {
            if ( $line =~ /$two/ ) {
                print OUT $chr, "\t", $pos1, "\t", $pos2, "\t", $fl, "\n";
            }
        }
    }
}

Answer 1

加快代码速度的几种方法：

首先读入并解析文件1并创建索引：

my %ix;
while (<AB>) {
    # skip the first line (with the column headers)
    next if $. == 1;
    chomp;
    # assuming that the data is tab-separated; if not, you can run split /\s+/
    my @arr = split "\t";
    # create a hash with structure $ix{col1}{col2} = "col3  col4"
    $ix{ $arr[0] }{ $arr[1] } = $arr[2] . "\t" . $arr[3];
}

现在在文件2中读取，一次一行，并查找匹配项：

while (<BC>) {
    chomp;
    # initialise a set of variables all at once
    # assumes the data is tab-delimited; if it isn't, use split /\s+/
    my ($chr, $pos1, $pos2, $g1, $g2, $g3) = split "\t";

    # $g1, $g2, and $g3 are the three IDs on the line. This code assumes they will
    # always be in the order that they appear in file 1.
    # look for $g1 in our index. if ( $ix{$g1} ) is shorthand for checking if a
    # variable is defined and is non-zero.
    if ( $ix{$g1} ) {
        # now, for each of $g2 and $g3
        foreach my $g ($g2, $g3) {
            # ... check whether we've got an index entry where it is the second key
            if ( $ix{$g1}{$g} ) {
                # print out the data joined by tabs
                print OUT join("\t", $chr, $pos1, $pos2, $g1, $g, $ix{$g1}{$g}) . "\n";
            }
        }
    }
    # do the same check for $g2 and $g3. We have to check whether $ix{$g2} exists
    # first as if we check $ix{$g2}{$g3} directly and $ix{$g2} DOESN'T exist,
    # Perl will create it. This is known as autovivification.
    if ($ix{$g2} && $ix{$g2}{$g3}) {
        print OUT join("\t", $chr, $pos1, $pos2, $g2, $g3, $ix{$g2}{$g3}) . "\n";
    }
}

Answer 2

$ cat tst.awk               
NR==FNR {
    if (NR>1)
        file1[$1,$2] = $0
    next
}
{
    for (i=3; i<=NF; i++)
        for (j=3; j<=NF; j++)
            if ( ($i,$j) in file1 )
                print $1, $2, $3, file1[$i,$j]
}
$ 
$ awk -f tst.awk file1 file2
chr8 18640000 18960000 ZGLP1   RRP1B   39.62   0.2972
chr8 18640000 18960000 ZGLP1   CDH24   51.21   0.2560
chr8 18640000 18960000 RRP1B   CDH24   20.8    1
chr8 19000000 19080000 BBCDI   DND1    19.44   0.2833
chr8 19000000 19080000 BBCDI   SOHLH2  36.61   0.2909
chr8 19000000 19080000 DND1    SOHLH2  18      0.8

如何加快两个文件之间的模式匹配

2 个答案: