Question

我有一个文件，其中包含成千上万的制表符分隔行，如下所示：

cluster11586    TRINITY_DN135758_c4_g1_i1   5'-adenylylsulfate reductase-like 4 9.10921
cluster41208    TRINITY_DN130890_c2_g1_i1   Anthranilate phosphoribosyltransferase, chloroplastic   18.5398
cluster26862    TRINITY_DN132510_c1_g1_i2   ATP synthase subunit alpha, mitochondrial   4.82626
cluster13001    TRINITY_DN130890_c4_g1_i3   Phosphopantetheine adenylyltransferase  2.58108

我想使用grep / awk / sed生成一个文件，该文件的文本在前两列之后，在最后一个十进制数字之前，并删除选项卡，并用下划线替换空白：

5'-adenylylsulfate_reductase-like_4
Anthranilate_phosphoribosyltransferase,_chloroplastic
ATP_synthase_subunit_alpha,_mitochondrial
Phosphopantetheine_adenylyltransferase

我想到了提取最终十进制数之前的所有内容，我可以将其与[0-9]+\.[0-9]+$匹配，然后将结果传递到类似于awk '{$1=$2=""; print $0}'的内容中，以删除前两列（希望以下两列选项卡），然后将其发送到sed -e 's/ /_/g'，但是如何在不获取十进制数字本身或前面的空格的情况下提取每一行中最后一个十进制数字之前的文本呢？而且awk似乎在删除了前两列后离开了标签页。我可以在不输出中间文件的情况下完成所有这些工作吗？

Answer 1

了解这将使您很好地了解awk如何与字段和字段分隔符一起使用以拆分和重组记录：

$ awk '{$1=$2=$NF=""; $0=$0; OFS="_"; $1=$1; OFS=FS} 1' file
5'-adenylylsulfate_reductase-like_4
Anthranilate_phosphoribosyltransferase,_chloroplastic
ATP_synthase_subunit_alpha,_mitochondrial
Phosphopantetheine_adenylyltransferase

步骤：

$ awk '{$1=$2=$NF=""; print "<" $0 ":" $1 ">"}' file
<  5'-adenylylsulfate reductase-like 4 :>
<  Anthranilate phosphoribosyltransferase, chloroplastic :>
<  ATP synthase subunit alpha, mitochondrial :>
<  Phosphopantetheine adenylyltransferase :>

$ awk '{$1=$2=$NF=""; $0=$0; print "<" $0 ":" $1 ">"}' file
<  5'-adenylylsulfate reductase-like 4 :5'-adenylylsulfate>
<  Anthranilate phosphoribosyltransferase, chloroplastic :Anthranilate>
<  ATP synthase subunit alpha, mitochondrial :ATP>
<  Phosphopantetheine adenylyltransferase :Phosphopantetheine>

$ awk '{$1=$2=$NF=""; $0=$0; $1=$1; print "<" $0 ":" $1 ">"}' file
<5'-adenylylsulfate reductase-like 4:5'-adenylylsulfate>
<Anthranilate phosphoribosyltransferase, chloroplastic:Anthranilate>
<ATP synthase subunit alpha, mitochondrial:ATP>
<Phosphopantetheine adenylyltransferase:Phosphopantetheine>

$ awk '{$1=$2=$NF=""; $0=$0; OFS="_"; $1=$1; OFS=FS; print "<" $0 ":" $1 ">"}' file
<5'-adenylylsulfate_reductase-like_4:5'-adenylylsulfate>
<Anthranilate_phosphoribosyltransferase,_chloroplastic:Anthranilate>
<ATP_synthase_subunit_alpha,_mitochondrial:ATP>
<Phosphopantetheine_adenylyltransferase:Phosphopantetheine>

Answer 2

删除前2个组合（不带标签的字符串-标签），
记住下一个不会以数字结尾的部分，
并匹配十进制数字。

sed -r 's/([^\t]*\t){2}(.*[^0-9])[0-9]*[.][0-9]*$/\2/' file

内切特的两个简单替代品

sed -r 's/([^\t]*\t){2}(.*[^0-9])[0-9]*[.][0-9]*$/\2/;s/ /_/g;s/\t//g' file

Answer 3

您可以这样做：

$ cut -d $'\t' -f 3- file | 
  sed -nE 's/^(.*)[[:space:]][[:digit:]][[:digit:]]*\.[[:digit:]][[:digit:]]*/\1/; s/[[:space:]]*$//; s/[[:space:]]/_/gp'
5'-adenylylsulfate_reductase-like_4
Anthranilate_phosphoribosyltransferase,_chloroplastic
ATP_synthase_subunit_alpha,_mitochondrial
Phosphopantetheine_adenylyltransferase

由于最后的十进制数字是制表符分隔的，因此您可以更多地依靠cut查找字段，而仅使用sed将' '更改为_：

$ cut -d $'\t' -f 3- file | cut -d $'\t' -f 1 | sed -E 's/[[:space:]]/_/g'

匹配前对每行中的所有字符进行Grep

3 个答案: