awk ...正则表达式..超出实现大小限制

时间:2018-01-29 00:53:09

标签: awk

是否有人碰巧对此错误有任何见解或建议,即这可以“修复”,如果是,那么最好?

  

awk:第1行:正则表达式/ splice_acc ...超出实现大小限制

我的bash脚本中使用的表达式是......

  

grep -v '^##' $IN | awk 'BEGIN{FS=" "; OFS=" "} $1~/#CHROM/ || $10~/^1\/1/ && ($11~/^1\/0/ || $11~/^0\/0/ || $11~/^0\/1/) && $1~/^[0-9X]*$/ && /splice_acceptor_variant|splice_donor_variant|splice_region_variant|stop_lost|start_lost|stop_gained|missense_variant|coding_sequence_variant|inframe_insertion|disruptive_inframe_insertion|inframe_deletion|disruptive_inframe_deletion|exon_variant|exon_loss_variant|exon_loss_variant|duplication|inversion|frameshift_variant|feature_ablation|duplication|gene_fusion|bidirectional_gene_fusion|rearranged_at_DNA_level|miRNA|initiator_codon_variant|start_retained/ {$3=$7=""; print $0}' | sed 's/ */ /g' | awk '{split($9,a,":"); split(a[2],b,","); if (b[1]>b[2] || $1~/#CHROM/) print $0}' > $OUT

感谢您给予的任何帮助,非常感谢。

感谢您的建议!

输入的样本是:

Chr1 926694 . C T 2510.49 . AB=0;ABP=0;AC=2;AF=1;AN=2;AO=82;CIGAR=1X;DP=85;DPB=85;DPRA=0;EPP=6.82362;EPPR=9.52472;GTI=0;LEN=1;MEANALT=1;MQM=57.0854;MQMR=60;NS=1;NUMALT=1;ODDS=108.152;PAIRED=1;PAIREDR=1;PAO=0;PQA=0;PQR=0;PRO=0;QA=2916;QR=42;RO=3;RPL=46;RPP=5.65844;RPPR=9.52472;RPR=36;RUN=1;SAF=45;SAP=4.70511;SAR=37;SRF=0;SRP=9.52472;SRR=3;TYPE=snp;ANN=T|upstream_gene_variant|MODIFIER|AT1G03720|AT1G03720|transcript|AT1G03720.1|protein_coding||c.-321G>A|||||321|,T|downstream_gene_variant|MODIFIER|AT1G03700|AT1G03700|transcript|AT1G03700.1|protein_coding||c.*4850C>T|||||4793|,T|downstream_gene_variant|MODIFIER|AT1G03710|AT1G03710|transcript|AT1G03710.1|protein_coding||c.*2407C>T|||||1968|,T|downstream_gene_variant|MODIFIER|AT1G03730|AT1G03730|transcript|AT1G03730.1|protein_coding||c.*4323G>A|||||4134|,T|downstream_gene_variant|MODIFIER|AT1G03710|AT1G03710|transcript|AT1G03710.2|protein_coding||c.*2407C>T|||||2339|,T|intergenic_region|MODIFIER|AT1G03720-AT1G03730|AT1G03720-AT1G03730|intergenic_region|AT1G03720-AT1G03730|||n.926694C>T|||||| GT:DP:AD:RO:QR:AO:QA:GL 1/1:85:3,82:3:42:82:2916:-252.316,-21.6676,0

1 个答案:

答案 0 :(得分:1)

我没有尝试将所有东西都放在那里,而是试图把它分成更小的部分,因为我正在努力绕过整个事物。

BEGIN {
    FS=" "; 
    OFS=" "
    # this is your big list of words that was making awk choke.
    # this list is available to the function test_words.
    split("splice_acceptor_variant splice_donor_variant splice_region_variant"\
          " stop_lost start_lost stop_gained missense_variant coding_sequence_variant"\
          " inframe_insertion disruptive_inframe_insertion inframe_deletion"\
          " disruptive_inframe_deletion exon_variant exon_loss_variant exon_loss_variant"\
          " duplication inversion frameshift_variant feature_ablation duplication"\
          " gene_fusion bidirectional_gene_fusion rearranged_at_DNA_level"\
          " miRNA initiator_codon_variant start_retained", test_word_arr)
} 

function test_words(hs) {
    # if any words from test_word_arr are in the string passed 
    # to this function, return true        
    for (i in test_word_arr) {
        if (match(hs, test_word_arr[i])) return 1;
    }
    return 0;
}

# apply the initial sed command
/^##/ { next }

# it appears to me that any string that starts '#CHROM' should 
# be printed with minimal editing - it has automatically passed
# the test for the second `awk` script 
$1 ~ /#CHROM/ {
    $3 = "";
    $7 = "";
    gsub(/  */, " ")
    print $0
}

# these were all the conditions that were expected to be true to
# perform the final processing. So they can be checked off one 
# by one, and if any are *not* true, the line can be skipped.
$10 !~ /^1\/1/ { next }
$11 !~ (/^1\/0/ || /^0\/[01]/) { next }
$1  !~ /^[X[:digit:]]*$/ { next }
# this is performing the test that couldn't be done previously
test_words($0) == 0 { next }

{
    # finally, any line still being assessed has 'passed' so 
    # perform the processing from your first awk script.
    $3 = "";
    $7="";

    # this is basically the following `sed` script
    gsub(/  */, " ")

    # and this is the final awk script
    split($9, a, ":"); split(a[2], b, ",");
    if (b[1] > b[2])
        print $0
}

由于没有示例输入/输出,这是未经测试的,因此可能需要检查和编辑任何问题。