根据文件结构awk检索特定字段

时间:2017-11-15 13:47:41

标签: awk

MyFile的:

    1       Cufflinks       exon    162752  163607  .       +       .       gene_id "XLOC_000001"; transcript_id "TCONS_00000001"; exon_number "1"; oId "CUFF.1.1"; class_code "u"; tss_id "TSS1";
    1       Cufflinks       exon    177199  177399  .       +       .       gene_id "XLOC_000002"; transcript_id "TCONS_00000002"; exon_number "1"; oId "CUFF.3.1"; class_code "u"; tss_id "TSS2";
    1       Cufflinks       exon    178775  179390  .       +       .       gene_id "XLOC_000003"; transcript_id "TCONS_00000003"; exon_number "1"; gene_name "ENSORLG00000000007"; oId "CUFF.15.1"; nearest_ref "ENSORLT00000000006"; class_code "s"; tss_id "TSS3";
    1       Cufflinks       exon    218671  219224  .       +       .       gene_id "XLOC_000007"; transcript_id "TCONS_00000005"; exon_number "1"; gene_name "slc43a1b"; oId "CUFF.50.1"; nearest_ref "ENSORLT00000000013"; class_code "s"; tss_id "TSS7";

Disired output:

    1       Cufflinks       exon    162752  163607  .       +       .       gene_id "XLOC_000001"; transcript_id "TCONS_00000001"; exon_number "1"; oId "CUFF.1.1"; class_code "u"; tss_id "TSS1";
    1       Cufflinks       exon    177199  177399  .       +       .       gene_id "XLOC_000002"; transcript_id "TCONS_00000002"; exon_number "1"; oId "CUFF.3.1"; class_code "u"; tss_id "TSS2";
    1       Cufflinks       exon    180630  180720  .       +       .       gene_id "XLOC_000003"; transcript_id "ENSORLT00000000006"; exon_number "5"; gene_name "ENSORLG00000000007"; oId "CUFF.15.1"; nearest_ref "ENSORLT00000000006"; class_code "s"; tss_id "TSS3";
    1       Cufflinks       exon    218671  219224  .       +       .       gene_id "XLOC_000007"; transcript_id "ENSORLT00000000013"; exon_number "1"; gene_name "slc43a1b"; oId "CUFF.50.1"; nearest_ref "ENSORLT00000000013"; class_code "s"; tss_id "TSS7";

说明:

如果有字段nearest_ref,请在字段transcript_id中写入,否则不执行任何操作。

字段nearest_ref

nearest_ref "XXXXXXX";

字段transcript_id

transcript_id "XXXXXXX";

如果可能,我想要一个awk解决方案。

我不知道如何根据他们的名字而不是他们的位置来检索字段:

awk -v FS=" " 'length($20)>20{$12=$20} 1' MyFile |less

注意:文件以制表符分隔,第9列以空格分隔。

编辑:找到了一种方法,但它真的很糟糕,我仍然对更好的方法感兴趣:

awk -v FS=" " -v OFS="\t" 'length($20)>20{$12=$20} 1' MyFile | sed "s/;\t/; /g" | sed 's/\t"/ "/g'

2 个答案:

答案 0 :(得分:2)

对于sed

来说足够简单
$ sed -E 's/(transcript_id )[^;]+(.*nearest_ref )([^;]+);/\1\3\2\3;/' file

输出

    1       Cufflinks       exon    162752  163607  .       +       .       gene_id "XLOC_000001"; transcript_id "TCONS_00000001"; exon_number "1"; oId "CUFF.1.1"; class_code "u"; tss_id "TSS1";
    1       Cufflinks       exon    177199  177399  .       +       .       gene_id "XLOC_000002"; transcript_id "TCONS_00000002"; exon_number "1"; oId "CUFF.3.1"; class_code "u"; tss_id "TSS2";
    1       Cufflinks       exon    178775  179390  .       +       .       gene_id "XLOC_000003"; transcript_id "ENSORLT00000000006"; exon_number "1"; gene_name "ENSORLG00000000007"; oId "CUFF.15.1"; nearest_ref "ENSORLT00000000006"; class_code "s"; tss_id "TSS3";
    1       Cufflinks       exon    218671  219224  .       +       .       gene_id "XLOC_000007"; transcript_id "ENSORLT00000000013"; exon_number "1"; gene_name "slc43a1b"; oId "CUFF.50.1"; nearest_ref "ENSORLT00000000013"; class_code "s"; tss_id "TSS7";

答案 1 :(得分:2)

每当你有name->值对时,我发现它是最清晰最容易修改的,以便先创建一个数组来保存那个映射(n2v []如下),然后只使用名称作为索引修改数组:

$ cat tst.awk
BEGIN {
    FS=OFS="\t"
    src = "nearest_ref"
    dst = "transcript_id"
}
{
    n = split($9,f," ")
    delete n2v
    for (i=1; i<=n; i+=2) {
        name  = f[i]
        value = f[i+1]
        n2v[name] = value
    }

    new = ""
    for (i=1; i<=n; i+=2) {
        name  = f[i]
        value = ((name == dst) && (src in n2v) ? n2v[src] : n2v[name])
        new = (i>1 ? new " " : "") name " " value
    }
    $9 = new

    print
}
$
$ awk -f tst.awk file
1       Cufflinks       exon    162752  163607  .       +       .       gene_id "XLOC_000001"; transcript_id "TCONS_00000001"; exon_number "1"; oId "CUFF.1.1"; class_code "u"; tss_id "TSS1";
1       Cufflinks       exon    177199  177399  .       +       .       gene_id "XLOC_000002"; transcript_id "TCONS_00000002"; exon_number "1"; oId "CUFF.3.1"; class_code "u"; tss_id "TSS2";
1       Cufflinks       exon    178775  179390  .       +       .       gene_id "XLOC_000003"; transcript_id "ENSORLT00000000006"; exon_number "1"; gene_name "ENSORLG00000000007"; oId "CUFF.15.1"; nearest_ref "ENSORLT00000000006"; class_code "s"; tss_id "TSS3";
1       Cufflinks       exon    218671  219224  .       +       .       gene_id "XLOC_000007"; transcript_id "ENSORLT00000000013"; exon_number "1"; gene_name "slc43a1b"; oId "CUFF.50.1"; nearest_ref "ENSORLT00000000013"; class_code "s"; tss_id "TSS7";

通过它你可以交换你喜欢的任何其他字段,改变它们输出的顺序或做任何你喜欢的事情,只需通过名字访问它们。