Question

以下是我想将其拆分为制表符单独部分的行。

>VFG000676(gb|AAD32411)_(lef)_anthrax_toxin_lethal_factor_precursor_[Anthrax_toxin_(VF0142)]_[Bacillus_anthracis_str._Sterne]

我想要的输出是

>VFG000676\t(gb|AAD32411)\t(lef)\tanthrax_toxin_lethal_factor_precursor\t [Anthrax_toxin_(VF0142)]\t[Bacillus_anthracis_str._Sterne]

我使用了这个命令

grep '>' x.fa | sed 's/^>\(.*\) (gi.*) \(.*\) \[\(.*\)\].*/\1\t\2\t\3/' | sed 's/ /_/g' > output.tsv

但输出不是我想要的。

更新：我最后通过使用以下代码修复了该问题

grep '>' VFs_no_block.fa | sed 's/^>\(.*\)\((.*)\) \((.*)\) \(.*\) \(\[.*(.*)]\) \(\[.*]\).*/\1\t\2\t\3\t\4\t\5\t\6/' | sed 's/ /_/g' > VFDB_annotation_reference.tsv

Answer 1

如果您真的需要文字标签，请将OFS="\\t"更改为OFS="\t"：

$ cat tst.awk
BEGIN { OFS="\\t" }
{
    c=0
    while ( match($0,/\[[^][]+\]|\([^)(]+\)|[^][)(]+/) ) {
        tgt = substr($0,RSTART,RLENGTH)
        gsub(/^_+|_+$/,"",tgt)
        if (tgt != "") {
            printf "%s%s", (c++ ? OFS : ""), tgt
        }
        $0 = substr($0,RSTART+RLENGTH)
    }
    print
}

$ awk -f tst.awk file
>VFG000676\t(gb|AAD32411)\t(lef)\tanthrax_toxin_lethal_factor_precursor\t[Anthrax_toxin_(VF0142)]\t[Bacillus_anthracis_str._Sterne]

需要帮助使用sed格式化一行

1 个答案: