在行的子集中删除部分字符串(但不是整行)?

时间:2016-05-04 17:34:02

标签: shell unix awk sed

我有一个制表符分隔的文本文件,其中包含4列和1亿行,如下所示:

chr1    10019   10020   rs775809821
chr2    10108   10109   rs376007522
chr3    10128   10128   rs796688738
chr4    10128   10128   rs796688738
chr5    10138   10139   rs368469931
chr6    10146   10147   rs779258992
chr7    10165   10165   rs796884232
chr8_KI270718v1_random  10149   10150   rs371194064
chr9_GL000221v1_random  10144   10145   rs144773400
chr10_KI270879v1_alt    10055   10055   rs768019142
chr11_KI270714v1_random 10107   10108   rs62651026

我想删除第一列中以" _"开头的部分从包含此的行。所以我希望输出看起来像:

chr1    10019   10020   rs775809821
chr2    10108   10109   rs376007522
chr3    10128   10128   rs796688738
chr4    10128   10128   rs796688738
chr5    10138   10139   rs368469931
chr6    10146   10147   rs779258992
chr7    10165   10165   rs796884232
chr8    10149   10150   rs371194064
chr9    10144   10145   rs144773400
chr10   10055   10055   rs768019142
chr11   10107   10108   s62651026

我尝试过使用sed(sed 's/_\S*\s*/ /' infile > outfile)这样做,但这只删除了" _"在包含我想要删除的字符串的行中。所以它看起来像这样:

chr1    10019   10020   rs775809821
chr2    10108   10109   rs376007522
chr3    10128   10128   rs796688738
chr4    10128   10128   rs796688738
chr5    10138   10139   rs368469931
chr6    10146   10147   rs779258992
chr7    10165   10165   rs796884232
chr8 KI270718v1_random  10149   10150   rs371194064
chr9 GL000221v1_random  10144   10145   rs144773400
chr10 KI270879v1_alt    10055   10055   rs768019142
chr11 KI270714v1_random 10107   10108   s62651026

如何只删除" _"中的部分行?仅在包含" chr#"之后的字符串的行中在第1栏?

3 个答案:

答案 0 :(得分:1)

您可以使用:

awk 'BEGIN{FS=OFS="\t"} $1 ~ /chr/{sub(/_.*$/, "", $1)} 1' file

<强>输出:

chr1   10019  10020  rs775809821
chr2   10108  10109  rs376007522
chr3   10128  10128  rs796688738
chr4   10128  10128  rs796688738
chr5   10138  10139  rs368469931
chr6   10146  10147  rs779258992
chr7   10165  10165  rs796884232
chr8   10149  10150  rs371194064
chr9   10144  10145  rs144773400
chr10  10055  10055  rs768019142
chr11  10107  10108  rs62651026

答案 1 :(得分:0)

你可以试试这个

sed -r 's/_\S+//' file

请注意,它不仅限于第一列。

答案 2 :(得分:0)

$ sed -r 's/^([^\t_]+)_[^\t]+/\1/' file
chr1    10019   10020   rs775809821
chr2    10108   10109   rs376007522
chr3    10128   10128   rs796688738
chr4    10128   10128   rs796688738
chr5    10138   10139   rs368469931
chr6    10146   10147   rs779258992
chr7    10165   10165   rs796884232
chr8    10149   10150   rs371194064
chr9    10144   10145   rs144773400
chr10   10055   10055   rs768019142
chr11   10107   10108   rs62651026