我有一张大桌子(数百万行,几百列,制表符分隔),前三列如下:
GT:DS:GP 0|0:0.181:0.827,0.165,0.008 0|0:0.181:0.827,0.165,0.008 0|0:0.181:0.827,0.165,0.008
GT:DS:GP 0|0:0.109:0.894,0.103,0.003 0|0:0.109:0.894,0.103,0.003 0|0:0.109:0.894,0.103,0.003
GT:DS:GP 0|0:0.004:0.996,0.004,0.000 0|0:0.004:0.996,0.004,0.000 0|0:0.004:0.996,0.004,0.000
GT:DS:GP 0|0:0.117:0.886,0.110,0.003 0|0:0.117:0.886,0.110,0.003 0|0:0.117:0.886,0.110,0.003
所有其余列看起来像第2列和第3列。我需要一个基于第一个文件的新文件,而第二个冒号(:)之后没有文字。输出应如下所示:
GT:DS 0|0:0.181 0|0:0.181 0|0:0.181
GT:DS 0|0:0.109 0|0:0.109 0|0:0.109
GT:DS 0|0:0.004 0|0:0.004 0|0:0.004
GT:DS 0|0:0.117 0|0:0.117 0|0:0.117
我觉得这可能与我在this post中发现的内容有些相似,但是显然exit命令告诉它在第一次出现之后停止,因此它不适用于多次出现(在几行中) /列)...
awk -v RS=':' -v ORS=':' 'NR==1{print} NR==2{print; printf"\n";exit}' input > output
此失败尝试的输出是:
GT:DS:
谢谢您的帮助!
答案 0 :(得分:3)
$ sed 's/\([^:]*:[^:]*\):[^:\t]*/\1/g' file
GT:DS 0|0:0.181 0|0:0.181 0|0:0.181
GT:DS 0|0:0.109 0|0:0.109 0|0:0.109
GT:DS 0|0:0.004 0|0:0.004 0|0:0.004
GT:DS 0|0:0.117 0|0:0.117 0|0:0.117