从制表符分隔的文件中删除带有模式的字符

时间:2021-03-24 08:43:55

标签: unix awk sed programming-pearls

我有带有模式的保存文件,例如

<头>
NODE_1_length_59711_cov_84.026979_g0_i0_1 12.8
NODE_1_length_59711_cov_84.026979_g0_i0_2 18.9
NODE_2_length_59711_cov_84.026979_g0_i0_1 14.3
NODE_2_length_59711_cov_84.026979_g0_i0_2 16.1
NODE_165433_length_59711_cov_84.026979_g0_i0_1 29

我想删除从“1”开始到最后“_”的所有字符。这样我就可以从多个文件中获得这样的输出-

<头>
1_1 12.8
1_2 18.9
2_1 14.3
2_2 16.1
165433_1 29

2 个答案:

答案 0 :(得分:2)

see demo

echo 'NODE_165433_length_59711_cov_84.026979_g0_i0_1' | sed -E 's/^NODE_([0-9]+)_.*_([0-9]+)/\1_\2/'

输出:

165433_1

答案 1 :(得分:1)

使用 GNU awk:

awk -F "\t" '{ fld1=gensub(/(^NODE_)([[:digit:]]+)(.*)([[:digit:]]+$)/,"\\2_\\4","g",$1);OFS=IFS;print fld1"\t"$2}' file

说明:

awk -F "\t" '{                                                       # Set the field separator to tab
               fld1=gensub(/(^NODE_)([[:digit:]]+)(.*)([[:digit:]]+$)/,"\\2_\\4","g",$1);                                      # Split the first field into 4 sections represented in parenthesis and then substitute the line for the the second section, a "_" and then the fourth section. Read the result into a variable fld1
               print fld1"\t"$2                                      # Print fld1, followed by a tab and then the second field.
             }' file
相关问题