Question

我需要找到两个制表符分隔文件之间的匹配，如下所示：

文件1：

ID1  1  65383896    65383896    G   C  PCNXL3
ID1  2  56788990        55678900        T       A  ACT1 
ID1  1   56788990       55678900       T       A  PRO55

文件2

ID2 34    65383896   65383896       G   C  MET5
ID2  2   56788990       55678900       T       A  ACT1 
ID2  2   56788990       55678900       T       A  HLA

我想要做的是检索两个文件之间的匹配线。我想要匹配的是基因ID之后的每一个

到目前为止，我已编写此代码，但遗憾的是perl一直给我错误：使用“在模式匹配中使用未初始化的值（m //）”

你能帮我弄清楚我做错了吗？

提前谢谢！

use strict;

open (INA, $ARGV[0]) || die "cannot to open gene file";
open (INB, $ARGV[1]) || die "cannot to open coding_annotated.var files";

my @sample1 = <INA>;
my @sample2 = <INB>;

foreach my $line (@sample1) {
    my @tab = split (/\t/, $line);

    my $chr   = $tab[1];
    my $start = $tab[2];
    my $end   = $tab[3];
    my $ref   = $tab[4];
    my $alt   = $tab[5];
    my $name  = $tab[6];

    foreach my $item (@sample2){
        my @fields = split (/\t/,$item);

        if (   $fields[1] =~ m/$chr(.*)/
            && $fields[2] =~ m/$start(.*)/
            && $fields[4] =~ m/$ref(.*)/
            && $fields[5] =~ m/$alt(.*)/
            && $fields[6] =~ m/$name(.*)/
        ) {     
            print  $line, "\n", $item;
        }
    }
}

Answer 1

从表面上看，你的代码似乎很好（虽然我没有调试它）。如果你没有发现我无法发现的错误，可能是输入数据有RE特殊字符，当你按原样放置它时会混淆正则表达式引擎（例如，如果任何变量有'$'字符）。也可以是，而不是制表符，你在某些地方有空格，在这种情况下你确实会收到错误，因为你的拆分会失败。

在任何情况下，你最好只编写一个包含所有字段的正则表达式。我的下面的代码多一点Perl Idiomatic。我喜欢使用隐式$ _，在我看来，这使得代码更具可读性。我刚刚用你的输入文件对它进行了测试，它完成了这项任务。

use strict;

open (INA, $ARGV[0]) or die "cannot open file 1";
open (INB, $ARGV[1]) or die "cannot open file 2";

my @sample1 = <INA>;
my @sample2 = <INB>;


foreach (@sample1) {
    (my $id, my $chr, my $start, my $end, my $ref, my $alt, my $name) =
        m/^(ID\d+)\s+(\w+)\s+(\w+)\s+(\w+)\s+(\w+)\s+(\w+)\s+(\w+)/;
    my $rex = "^ID\\d+\\s+$chr\\s+$start\\s+$end\\s+$ref\\s+$alt\\s+$name\\s+";
    #print "$rex\n";
    foreach (@sample2) {
        if( m/$rex/ ) {
            print "$id - $_";
        }
    }
}

此外，输入数据的规律性如何？字段之间只有一个标签吗？如果是这种情况，则没有必要将这些行拆分为7个不同的字段 - 您只需要两个：该行的ID部分，其余部分。第一个正则表达式是

(my $id, my $restOfLine) = m/^(ID\d+)\s+(.*)$/;

您正在使用与上述类似的技术在第二个文件中搜索$ restOfLine。

如果您的文件巨大并且性能存在问题，则应考虑将第一个正则表达式（或字符串）放在地图中。这将给你O（n * log（m））其中n和m是每个文件中的行数。

最后，当我需要比较日志时，我遇到了类似的挑战。日志应该是相同的，除了每行开头的时间标记。但更重要的是：大多数行都是并按顺序。如果这是你拥有的，并且对你有意义，你可以：

首先从每行删除IDxxx：perl -pe "s/ID\d+ +//"文件＆gt; cleanfile
然后使用BeyondCompare或Windiff来比较文件。

Answer 2

我用你的代码玩了一下。你在那里写的实际上有三个循环：

第一个文件的一行，
第二个文件的一行，
这些行中的所有字段之一。您手动展开此循环。

本答案的其余部分假定文件是严格按制表符分隔的，并且任何其他空格很重要（即使在字段和行的末尾）。

以下是代码的精简版本（假设打开文件句柄$file1，$file2和use strict）：

my @sample2 = <$file2>;

SAMPLE_1:
foreach my $s1 (<$file1>) {
    my (undef, @fields1) = split /\t/, $s1;
    my @regexens = map qr{\Q$_\E(.*)}, @fields1;

    SAMPLE_2:
    foreach my $s2 (@sample2) {
        my (undef, @fields2) = split /\t/, $s2;
        for my $i (0 .. $#regexens) {
            $fields2[$i] =~ $regexens[$i] or next SAMPLE_2;
        }
        # only gets here if all regexes matched
        print $s1, $s2;
    }
}

我做了一些优化：预编译各种正则表达式并将它们存储在一个数组中，引用字段的内容等。但是，这个算法是 O（n²），这很糟糕。

以下是该算法的一个优雅变体，知道只有第一个字段不同 - 该行的其余部分必须与字符的字符相同：< / p>

my @sample2 = <$file2>;

foreach my $s1 (<$file1>) {
    foreach my $s2 (@sample2) {
        print $s1, $s2 if (split /\t/, $s1, 2)[1] eq (split /\t/, $s2, 2)[1];
    }
}

我只测试其余行的字符串相等性。虽然这个算法仍然是 O（n²），但它仅仅通过避免脑卒中正则表达式大致优于第一个解决方案一个数量级。

最后，这是一个 O（n）解决方案。它是前一个的变体，但是在之后执行循环，而不是在之内，因此在线性时间内完成。我们使用哈希：

# first loop via map my %seen = map {reverse(split /\t/, $_, 2)} # map {/\S/ ? $_ : () } # uncomment this line to handle empty lines <$file1>; # 2nd loop foreach my $line (<$file2>) { my ($id2, $key) = split /\t/, $line, 2; if (defined (my $id1 = $seen{$key})) { print "$id1\t$key"; print "$id2\t$key"; } }

%seen是一个散列，其余部分作为键，第一个字段作为值。在第二个循环中，我们再次检索该行的其余部分。如果第一个文件中存在此行，我们将重建整行并将其打印出来。这种解决方案比其他解决方案更好，并且由于其线性复杂性而向上和向下扩展良好

Answer 3

怎么样：

#!/usr/bin/perl

use File::Slurp;
use strict;

my ($ina, $inb) = @ARGV;

my @lines_a = File::Slurp::read_file($ina);
my @lines_b = File::Slurp::read_file($inb);

my $table_b = {};

my $ln = 0;

# Store all lines in second file in a hash with every different value as a hash key
# If there are several identical ones we store them also, so the hash values are lists  containing the id and line number
foreach (@lines_b) {
    chomp; # strip newlines
    $ln++; # count current line number 
    my ($id, $rest) = split(m{[\t\s]+}, $_, 2); # split on whitespaces, could be too many tabs or spaces instead
    if (exists $table_b->{$rest}) {
        push @{ $table_b->{$rest} }, [$id, $ln]; # push to existing list if we already found an entry that is the same
    } else {
        $table_b->{$rest} = [ [$id, $ln] ]; # create new entry if this is the first one
    }
} 

# Go thru first file and print out all matches we might have
$ln = 0;
foreach (@lines_a) {
    chomp;
    $ln++; 
    my ($id, $rest) = split(m{[\t\s]+}, $_, 2);
    if (exists $table_b->{$rest}) { # if we have this entry print where it is found
        print "$ina:$ln:\t\t'$id\t$rest'\n " . (join '\n ', map { "$inb:$_->[1]:\t\t'$_->[0]\t$rest'" } @{ $table_b->{$rest} }) . "\n";
    }
}

正则表达式代码

3 个答案: