Perl:正则表达式 - 将值与字母表匹配

时间:2014-09-15 03:46:35

标签: regex bash perl

我写了一个小的perl“hack”,用一个制表符分隔文件中的一系列列中的字母替换1。该文件如下所示:

Chr Start   End Name    Score   Strand  Donor   Acceptor    Merged_Transcript   Gencode Colon   Heart   Kidney  Liver   Lung    Stomach
chr10   100177483   100177931   .   .   -   1   1   1   1   1   0   1   1   0   0
chr10   100178014   100179801   .   .   -   1   1   1   1   1   1   1   1   1   0
chr10   100179915   100182125   .   .   -   1   1   1   1   1   1   1   0   1   0
chr10   100182270   100183359   .   .   -   1   1   1   1   0   0   1   0   1   0
chr10   100183644   100184069   .   .   -   1   1   1   1   0   0   1   0   1   0

如果在这些列中看到值1,则gola将采用第11到第16列并将字母A附加到Z.到目前为止,我的代码产生了一个空输出,这是我第一次做正则表达式。

cat infile.txt \
| perl -ne '@alphabet=("A".."Z");
            $is_known_intron = 0;
            $is_known_donor = 1;
            $is_known_acceptor = 1;
            chomp;
            $_ =~ s/^\s+//;
            @d = split /\s+/, $_;
            @d_bool=@d[$11-$16];
            $ct=1;
            $known_intron = $d[$10];
            $num_of_overlapping_gene = $d[$9];
            $known_acceptor = $d[$8];
            $known_donor = $d[$7];
            $k="";
            if (($known_intron == $is_known_intron) and ($known_donor == $is_known_donor) and ($known_acceptor == $is_known_acceptor)) {
               for ($i = 0; $i < scalar @d_bool; $i++){
                   $k.=$alphabet[$i] if ($d_bool[$i])
                }
                $alphabet_ct{$k}+=$ct;
            }
            END
            {
               foreach $k (sort keys %alphabet_ct){
                   print join("\t", $k, $alphabet_ct{$k}), "\n";
               }
            } '\
   > Outfile.txt

我应该做什么呢?

谢谢!

*编辑*

预期产出

ABCD 45
BCD 23
ABCDEF 1215

等等。

1 个答案:

答案 0 :(得分:1)

我将您的代码转换为脚本以便于调试。我在代码中添加注释以指出狡猾的位:

use strict;
use warnings;

my %alphabet_ct;
my @alphabet = ( "A" .. "Z" );

my $is_known_intron   = 0;
my $is_known_donor    = 1;
my $is_known_acceptor = 1;

while (<DATA>) {
    # don't process the first line
    next unless /chr10/;
    chomp;
    # this should remove whitespace at the beginning of the line but is doing nothing as there is none
    $_ =~ s/^\s+//;

    my @d = split /\s+/, $_;
    # the range operator in perl is .. (not "-")
    my @d_bool         = @d[ 10 .. 15 ];
    my $known_intron   = $d[9];
    my $known_acceptor = $d[7];
    my $known_donor    = $d[6];
    my $k              = "";
    # this expression is false for all the data in the sample you provided as
    # $is_known_intron is set to 0
    if (    ( $known_intron   == $is_known_intron )
        and ( $known_donor    == $is_known_donor )
        and ( $known_acceptor == $is_known_acceptor ) )
    {
        for ( my $i = 0; $i < scalar @d_bool; $i++ ) {
            $k .= $alphabet[$i] if $d_bool[$i];
        }
        # it is more idiomatic to write $alphabet_ct{$k}++;
        # $alphabet_ct{$k} += $ct;
        $alphabet_ct{$k}++;
    }
}
foreach my $k ( sort keys %alphabet_ct ) {
    print join( "\t", $k, $alphabet_ct{$k} ) . "\n";
}

__DATA__
Chr Start   End Name    Score   Strand  Donor   Acceptor    Merged_Transcript   Gencode Colon   Heart   Kidney  Liver   Lung    Stomach
chr10   100177483   100177931   .   .   -   1   1   1   1   1   0   1   1   0   0
chr10   100178014   100179801   .   .   -   1   1   1   1   1   1   1   1   1   0
chr10   100179915   100182125   .   .   -   1   1   1   1   1   1   1   0   1   0
chr10   100182270   100183359   .   .   -   1   1   1   1   0   0   1   0   1   0
chr10   100183644   100184069   .   .   -   1   1   1   1   0   0   1   0   1   0

$is_known_intron设置为1,样本数据会显示结果:

ABCDE   1
ABCE    1
ACD 1
CE  2