我写了一个小的perl“hack”,用一个制表符分隔文件中的一系列列中的字母替换1。该文件如下所示:
Chr Start End Name Score Strand Donor Acceptor Merged_Transcript Gencode Colon Heart Kidney Liver Lung Stomach
chr10 100177483 100177931 . . - 1 1 1 1 1 0 1 1 0 0
chr10 100178014 100179801 . . - 1 1 1 1 1 1 1 1 1 0
chr10 100179915 100182125 . . - 1 1 1 1 1 1 1 0 1 0
chr10 100182270 100183359 . . - 1 1 1 1 0 0 1 0 1 0
chr10 100183644 100184069 . . - 1 1 1 1 0 0 1 0 1 0
如果在这些列中看到值1,则gola将采用第11到第16列并将字母A附加到Z.到目前为止,我的代码产生了一个空输出,这是我第一次做正则表达式。
cat infile.txt \
| perl -ne '@alphabet=("A".."Z");
$is_known_intron = 0;
$is_known_donor = 1;
$is_known_acceptor = 1;
chomp;
$_ =~ s/^\s+//;
@d = split /\s+/, $_;
@d_bool=@d[$11-$16];
$ct=1;
$known_intron = $d[$10];
$num_of_overlapping_gene = $d[$9];
$known_acceptor = $d[$8];
$known_donor = $d[$7];
$k="";
if (($known_intron == $is_known_intron) and ($known_donor == $is_known_donor) and ($known_acceptor == $is_known_acceptor)) {
for ($i = 0; $i < scalar @d_bool; $i++){
$k.=$alphabet[$i] if ($d_bool[$i])
}
$alphabet_ct{$k}+=$ct;
}
END
{
foreach $k (sort keys %alphabet_ct){
print join("\t", $k, $alphabet_ct{$k}), "\n";
}
} '\
> Outfile.txt
我应该做什么呢?
谢谢!
*编辑*
预期产出
ABCD 45
BCD 23
ABCDEF 1215
等等。
答案 0 :(得分:1)
我将您的代码转换为脚本以便于调试。我在代码中添加注释以指出狡猾的位:
use strict;
use warnings;
my %alphabet_ct;
my @alphabet = ( "A" .. "Z" );
my $is_known_intron = 0;
my $is_known_donor = 1;
my $is_known_acceptor = 1;
while (<DATA>) {
# don't process the first line
next unless /chr10/;
chomp;
# this should remove whitespace at the beginning of the line but is doing nothing as there is none
$_ =~ s/^\s+//;
my @d = split /\s+/, $_;
# the range operator in perl is .. (not "-")
my @d_bool = @d[ 10 .. 15 ];
my $known_intron = $d[9];
my $known_acceptor = $d[7];
my $known_donor = $d[6];
my $k = "";
# this expression is false for all the data in the sample you provided as
# $is_known_intron is set to 0
if ( ( $known_intron == $is_known_intron )
and ( $known_donor == $is_known_donor )
and ( $known_acceptor == $is_known_acceptor ) )
{
for ( my $i = 0; $i < scalar @d_bool; $i++ ) {
$k .= $alphabet[$i] if $d_bool[$i];
}
# it is more idiomatic to write $alphabet_ct{$k}++;
# $alphabet_ct{$k} += $ct;
$alphabet_ct{$k}++;
}
}
foreach my $k ( sort keys %alphabet_ct ) {
print join( "\t", $k, $alphabet_ct{$k} ) . "\n";
}
__DATA__
Chr Start End Name Score Strand Donor Acceptor Merged_Transcript Gencode Colon Heart Kidney Liver Lung Stomach
chr10 100177483 100177931 . . - 1 1 1 1 1 0 1 1 0 0
chr10 100178014 100179801 . . - 1 1 1 1 1 1 1 1 1 0
chr10 100179915 100182125 . . - 1 1 1 1 1 1 1 0 1 0
chr10 100182270 100183359 . . - 1 1 1 1 0 0 1 0 1 0
chr10 100183644 100184069 . . - 1 1 1 1 0 0 1 0 1 0
将$is_known_intron
设置为1,样本数据会显示结果:
ABCDE 1
ABCE 1
ACD 1
CE 2