Question

在下面awk我试图提取并比较$4中与p.成星的每个子字符串。如果前三个字母与后三个字母相同（中间有一个数字），则p.更新为p.(3 letters)(digit)(=) --- ()仅显示那里是3个肠，不需要。如果3 letters不同，则该行不变。在示例中的下面的file行1中。在我的实际数据中，大约有10,000行，大约有50列，但$4是唯一一个在ut中有这些值的行，即te p. p.的格式将总是三个字母后跟一个1-4位＃然后再输三个字母。我认为下面的awk尝试会提取每个p.并在;上拆分，但我不确定如何比较以检查三个字母是否相同。谢谢你:)。

档案 tab-delimited

Chr Start   ExonicFunc.refGene  AAChange.refGene
chr1    155880573   synonymous SNV  RIT1:NM_001256821:exon2:c.31G>C:p.Glu110Glu;RIT1:NM_001256822:exon2:c.31G>C:p.Glu110Glu
chr1    155880573   nonsynonymous SNV   RIT1:NM_001256821:exon2:c.31G>C:p.Glu11Gln

所需的输出 tab-delimited

Chr Start   ExonicFunc.refGene  AAChange.refGene
chr1    155880573   synonymous SNV  RIT1:NM_001256821:exon2:c.31G>C:p.Glu110=;RIT1:NM_001256822:exon2:c.31G>C:p.Glu110=
chr1    155880573   nonsynonymous SNV   RIT1:NM_001256821:exon2:c.31G>C:p.Glu11Gln

AWK

awk '
BEGIN { OFS="\t" }
$4 ~ /:NM/ {
ostring=""
# split $4 by ";" and cycle through them
nNM=split($4,NM,";")
for (n=1; n<=nNM; n++) {
  if (n>1) ostring=(ostring ";") # append ";"
   if (match(NM[n],/p[.].*/)) {
     # copy up to "p."
     ostring=(ostring substr(NM[n],1,RSTART+1))
     # Get the substring after "p."
     VAL=substr(NM[n],RSTART+2)
     # Get its length
     lenVAL=length(VAL)
     # store aa array
     aa=[{while(length($4)=3){print substr($044,1,3);gsub(/^./,"")}]}' file

Answer 1

扩展GNU awk 解决方案：

awk 'NR==1; NR > 1{ 
         len = split($4, a, /\<p\.[a-zA-Z]{3}[0-9]+[a-zA-Z]{3}\>/, seps);
         if (len == 1){ print; next }
         res = "" 
         for (i=1; i < len; i++) {
             s = seps[i]; 
             if (substr(s, 3, 3) == substr(s, length(s) - 2)) {
                 seps[i] = substr(s, 1, length(s) - 3)"="; 
             }
         } 
         for (i=1; i <= len; i++) 
             res = res a[i] (seps[i]? seps[i]:""); 
         $4 = res; print 
     }' FS='\t' OFS='\t' file

输出：

Chr Start   ExonicFunc.refGene  AAChange.refGene
chr1    155880573   synonymous SNV  RIT1:NM_001256821:exon2:c.31G>C:p.Glu110=;RIT1:NM_001256822:exon2:c.31G>C:p.Glu110=
chr1    155880573   nonsynonymous SNV   RIT1:NM_001256821:exon2:c.31G>C:p.Glu11Gln

时间表现测量：

输入testfile：

$ wc -l testfile
10000 testfile

time(awk 'NR==1; NR > 1{ 
         len = split($4, a, /\<p\.[a-zA-Z]{3}[0-9]+[a-zA-Z]{3}\>/, seps);
         if (len == 1){ print; next }
         res = "" 
         for (i=1; i < len; i++) {
             s = seps[i]; 
             if (substr(s, 3, 3) == substr(s, length(s) - 2)) {
                 seps[i] = substr(s, 1, length(s) - 3)"="; 
             }
         } 
         for (i=1; i <= len; i++) 
             res = res a[i] (seps[i]? seps[i]:""); 
         $4 = res; print 
     }' FS='\t' OFS='\t' testfile >/dev/null)

real    0m0.269s
user    0m0.256s
sys 0m0.000s

time(awk 'BEGIN { FS=OFS="\t" }
NR>1 {
    head = ""
    tail = $4
    while ( match(tail,/(p\.([[:alpha:]]{3})[0-9]+)([[:alpha:]]{3})/,a) ) {
        head = head substr(tail,1,RSTART-1) a[1] (a[2] == a[3] ? "=" : a[3])
        tail = substr(tail,RSTART+RLENGTH)
    }
    $4 = head tail
}
{ print }' testfile >/dev/null)

real    0m0.470s
user    0m0.416s
sys 0m0.008s

Answer 2

使用GNU awk为第3个arg匹配（）：

Optional<Integer> input = Optional.of(42);
Optional<String> result = transformNullSafe(
    input, 
    new Function<Integer, String>() {
        public String apply(Integer i) {
            return methodThatMighReturnNull(true, i);
        }
    });

awk比较字段中子字符串的值

2 个答案: