下面的Perl脚本是用shell编写的。
如果我使用制表符分隔的文件numeric
,那么我会得到相应解析的每行的所需结果。但是,如果我使用文件alpha
作为输入,则只解析第一行。
alpha
和numeric
之间的唯一区别是numeric
有NC_000023
NC_000023.11:g.41747805_41747806delinsTT
NC_000023.11:g.41750615C>A
alpha
有NC_0000X
NC_0000X.11:g.41747805_41747806delinsTT
NC_0000X.11:g.41750615C>A
我错过了什么?
Input Variant Errors Chromosomal Variant Coding Variant(s)
NM_003924.3:c.*18_*19delGCinsAA NC_000023.11:g.41747805_41747806delinsTT LRG_513t1:c.*18_*19delinsAA NM
NM_003924.3:c.013G>T NC_000023.11:g.41750615C>A LRG_513t1:c.13G>T
Input Variant Errors Chromosomal Variant Coding Variant(s)
NM_003924.3:c.*18_*19delGCinsAA NC_0000X.11:g.41747805_41747806delinsTT LRG_513t1:c.*18_*19delinsAA NM_003924.3:c.*18_*19delinsAA
NM_003924.3:c.013G>T NC_0000X.11:g.41750615C>A LRG_513t1:c.13G>T NM_003924.3:c.13G>T
perl -ne '
next if $. == 1;
if ( /.*del([A-Z]+)ins([A-Z]+).*NC_0+([^.]+)\..*g\.([0-9]+)_([0-9]+)/ ) { # indel
print join( "\t", $3, $4, $5, $1, $2 ), "\n";
}
else {
while ( /\t*NC_(\d+)\.\S+g\.(\d+)(\S+)/g ) {
# conditional parse
( $num1, $num2, $common ) = ( $1, $2, $3 );
$num3 = $num2;
if ( $common =~ /^([A-Z])>([A-Z])$/ ) { # SNP
( $ch1, $ch2 ) = ( $1, $2 );
}
elsif ( $common =~ /^del([A-Z])$/ ) { # deletion
( $ch1, $ch2 ) = ( $1, "-" );
}
elsif ( $common =~ /^ins([A-Z])$/ ) { # insertion
( $ch1, $ch2 ) = ( "-", $1 );
}
elsif ( $common =~ /^_(\d+)del([A-Z]+)$/ ) { # multi deletion
( $num3, $ch1, $ch2 ) = ( $1, $2, "-" );
}
elsif ( $common =~ /^_(\d+)ins([A-Z]+)$/ ) { # multi insertion
( $num3, $ch1, $ch2 ) = ( "-", $1, $2 );
}
printf( "%d\t%d\t%d\t%s\t%s\n", $num1, $num2, $num3, $ch1, $ch2 ); # output
map { undef } ( $num1, $num2, $num3, $common, $ch1, $ch2 );
}
}' numeric
23 41747805 41747806 GC AA
23 41750615 41750615 C A
使用alpha输出
X 41747805 41747806 GC AA
如果我在\w
条件中使用\d
而不是while
,就像这样
while ( /\t*NC_(\w+)\.\S+g\.(\d+)(\S+)/g ) { ... }
我得到了这个结果
X 41747805 41747806 GC AA
0 41750615 41750615 C A
为什么$1
答案 0 :(得分:2)
while (/\t*NC_(\d+)\.
与'NC_0000X.11'不匹配,因为'X'而且正则表达式只查找数字。
完成更改后,NC_(\w+)
将匹配“NC_0000X”,$num1
设置为“0000X”。
您的printf "%d...." $num1 ...
将为非数字输入打印0。当$num1
为'0000X'时,它将打印为0.
输入示例表明,每一行都由字段组成,这些字段由空格分隔。有些领域是有意义的,不是。每个字段都包含可识别的信息。
您的程序应遵循此结构。
在较小的chuncks上工作更容易,而不是找到适用于整个生产线的正则表达式。它更容易阅读,理解和维护。