MyFile的:
1 Cufflinks exon 162752 163607 . + . gene_id "XLOC_000001"; transcript_id "TCONS_00000001"; exon_number "1"; oId "CUFF.1.1"; class_code "u"; tss_id "TSS1";
1 Cufflinks exon 177199 177399 . + . gene_id "XLOC_000002"; transcript_id "TCONS_00000002"; exon_number "1"; oId "CUFF.3.1"; class_code "u"; tss_id "TSS2";
1 Cufflinks exon 178775 179390 . + . gene_id "XLOC_000003"; transcript_id "TCONS_00000003"; exon_number "1"; gene_name "ENSORLG00000000007"; oId "CUFF.15.1"; nearest_ref "ENSORLT00000000006"; class_code "s"; tss_id "TSS3";
1 Cufflinks exon 218671 219224 . + . gene_id "XLOC_000007"; transcript_id "TCONS_00000005"; exon_number "1"; gene_name "slc43a1b"; oId "CUFF.50.1"; nearest_ref "ENSORLT00000000013"; class_code "s"; tss_id "TSS7";
Disired output:
1 Cufflinks exon 162752 163607 . + . gene_id "XLOC_000001"; transcript_id "TCONS_00000001"; exon_number "1"; oId "CUFF.1.1"; class_code "u"; tss_id "TSS1";
1 Cufflinks exon 177199 177399 . + . gene_id "XLOC_000002"; transcript_id "TCONS_00000002"; exon_number "1"; oId "CUFF.3.1"; class_code "u"; tss_id "TSS2";
1 Cufflinks exon 180630 180720 . + . gene_id "XLOC_000003"; transcript_id "ENSORLT00000000006"; exon_number "5"; gene_name "ENSORLG00000000007"; oId "CUFF.15.1"; nearest_ref "ENSORLT00000000006"; class_code "s"; tss_id "TSS3";
1 Cufflinks exon 218671 219224 . + . gene_id "XLOC_000007"; transcript_id "ENSORLT00000000013"; exon_number "1"; gene_name "slc43a1b"; oId "CUFF.50.1"; nearest_ref "ENSORLT00000000013"; class_code "s"; tss_id "TSS7";
说明:
如果有字段nearest_ref
,请在字段transcript_id
中写入,否则不执行任何操作。
字段nearest_ref
:
nearest_ref "XXXXXXX";
字段transcript_id
:
transcript_id "XXXXXXX";
如果可能,我想要一个awk
解决方案。
我不知道如何根据他们的名字而不是他们的位置来检索字段:
awk -v FS=" " 'length($20)>20{$12=$20} 1' MyFile |less
注意:文件以制表符分隔,第9列以空格分隔。
编辑:找到了一种方法,但它真的很糟糕,我仍然对更好的方法感兴趣:
awk -v FS=" " -v OFS="\t" 'length($20)>20{$12=$20} 1' MyFile | sed "s/;\t/; /g" | sed 's/\t"/ "/g'
答案 0 :(得分:2)
对于sed
$ sed -E 's/(transcript_id )[^;]+(.*nearest_ref )([^;]+);/\1\3\2\3;/' file
输出
1 Cufflinks exon 162752 163607 . + . gene_id "XLOC_000001"; transcript_id "TCONS_00000001"; exon_number "1"; oId "CUFF.1.1"; class_code "u"; tss_id "TSS1";
1 Cufflinks exon 177199 177399 . + . gene_id "XLOC_000002"; transcript_id "TCONS_00000002"; exon_number "1"; oId "CUFF.3.1"; class_code "u"; tss_id "TSS2";
1 Cufflinks exon 178775 179390 . + . gene_id "XLOC_000003"; transcript_id "ENSORLT00000000006"; exon_number "1"; gene_name "ENSORLG00000000007"; oId "CUFF.15.1"; nearest_ref "ENSORLT00000000006"; class_code "s"; tss_id "TSS3";
1 Cufflinks exon 218671 219224 . + . gene_id "XLOC_000007"; transcript_id "ENSORLT00000000013"; exon_number "1"; gene_name "slc43a1b"; oId "CUFF.50.1"; nearest_ref "ENSORLT00000000013"; class_code "s"; tss_id "TSS7";
答案 1 :(得分:2)
每当你有name->值对时,我发现它是最清晰最容易修改的,以便先创建一个数组来保存那个映射(n2v []如下),然后只使用名称作为索引修改数组:
$ cat tst.awk
BEGIN {
FS=OFS="\t"
src = "nearest_ref"
dst = "transcript_id"
}
{
n = split($9,f," ")
delete n2v
for (i=1; i<=n; i+=2) {
name = f[i]
value = f[i+1]
n2v[name] = value
}
new = ""
for (i=1; i<=n; i+=2) {
name = f[i]
value = ((name == dst) && (src in n2v) ? n2v[src] : n2v[name])
new = (i>1 ? new " " : "") name " " value
}
$9 = new
print
}
$
$ awk -f tst.awk file
1 Cufflinks exon 162752 163607 . + . gene_id "XLOC_000001"; transcript_id "TCONS_00000001"; exon_number "1"; oId "CUFF.1.1"; class_code "u"; tss_id "TSS1";
1 Cufflinks exon 177199 177399 . + . gene_id "XLOC_000002"; transcript_id "TCONS_00000002"; exon_number "1"; oId "CUFF.3.1"; class_code "u"; tss_id "TSS2";
1 Cufflinks exon 178775 179390 . + . gene_id "XLOC_000003"; transcript_id "ENSORLT00000000006"; exon_number "1"; gene_name "ENSORLG00000000007"; oId "CUFF.15.1"; nearest_ref "ENSORLT00000000006"; class_code "s"; tss_id "TSS3";
1 Cufflinks exon 218671 219224 . + . gene_id "XLOC_000007"; transcript_id "ENSORLT00000000013"; exon_number "1"; gene_name "slc43a1b"; oId "CUFF.50.1"; nearest_ref "ENSORLT00000000013"; class_code "s"; tss_id "TSS7";
通过它你可以交换你喜欢的任何其他字段,改变它们输出的顺序或做任何你喜欢的事情,只需通过名字访问它们。