在匹配STRING的前8行前面添加“#”

时间:2015-04-28 17:55:47

标签: awk sed text-parsing

这个问题有点令人困惑,所以我只想展示一个例子。

假设我有以下情况:

$ grep -P "locus_tag\tM715_1000193188" Genome.tbl -B1 -A8
193188  193066  gene
            locus_tag   M715_1000193188
193188  193066  mRNA
            product hypothetical protein
            protein_id  gnl|CorradiLab|M715_1000193188
            transcript_id   gnl|CorradiLab|M715_mrna1000193188
193188  193066  CDS
        product hypothetical protein
        protein_id  gnl|CorradiLab|M715_1000193188
        transcript_id   gnl|CorradiLab|M715_mrna1000193188

我想在“locus_tag M715_1000193188”之后的8行添加“#”,以便我的修改后的文件如下所示:

193188  193066  gene
            locus_tag   M715_1000193188
#193188 193066  mRNA
#           product hypothetical protein
#           protein_id  gnl|CorradiLab|M715_1000193188
#           transcript_id   gnl|CorradiLab|M715_mrna1000193188
#193188 193066  CDS
#       product hypothetical protein
#       protein_id  gnl|CorradiLab|M715_1000193188
#       transcript_id   gnl|CorradiLab|M715_mrna1000193188

基本上我有一个包含~3000个不同基因座标签的文件,其中300个我需要注释掉mRNA和CDS功能,所以跟随locus_tag行的8行。

用sed做任何可行的方法吗?文件中还有其他类型的信息需要保持不变。

谢谢, 阿德里安

4 个答案:

答案 0 :(得分:3)

如果您可以使用awk,则应执行以下操作:

awk 'f&&f-- {$0="#"$0} /locus_tag/ {f=8} 1' file
193188  193066  gene
            locus_tag   M715_1000193188
#193188  193066  mRNA
#            product hypothetical protein
#            protein_id  gnl|CorradiLab|M715_1000193188
#            transcript_id   gnl|CorradiLab|M715_mrna1000193188
#193188  193066  CDS
#        product hypothetical protein
#        protein_id  gnl|CorradiLab|M715_1000193188
#        transcript_id   gnl|CorradiLab|M715_mrna1000193188

答案 1 :(得分:1)

sed支持范围Addresses,可以在此处执行您想要的操作。

sed -e '/locus_tag\tM715_1000193188/,+8s/^/#/' file

如评论中所述,此范围地址格式是GNU特定的。

答案 2 :(得分:0)

$ cat tst.awk
BEGIN { split(tags,tmp); for (i in tmp) tagsA[tmp[i]] }
c&&c-- { $0 = "#" $0 }
($(NF-1) == "locus_tag") && ($NF in tagsA) { c=8 }
{ print }

$ awk -v tags="M715_1000193188 M715_1000193189 M715_1000193190" -f tst.awk file
193188  193066  gene
            locus_tag   M715_1000193188
#193188  193066  mRNA
#            product hypothetical protein
#            protein_id  gnl|CorradiLab|M715_1000193188
#            transcript_id   gnl|CorradiLab|M715_mrna1000193188
#193188  193066  CDS
#        product hypothetical protein
#        protein_id  gnl|CorradiLab|M715_1000193188
#        transcript_id   gnl|CorradiLab|M715_mrna1000193188

只需列出您关注的所有300个基因座标签值,如上所示为3个示例。

答案 3 :(得分:0)

这可能适合你(GNU sed):

sed 's/.*/\\#locus_tag\\s*&#,+9{\\#locus_tag\\s*&#n;s|^|#|}/' tag_file |
sed -i -f - file

这会从标记文件创建一个sed脚本,并在标记匹配后的八行前面添加#