我有一个制表符分隔的文本文件,其中包含4列和1亿行,如下所示:
chr1 10019 10020 rs775809821
chr2 10108 10109 rs376007522
chr3 10128 10128 rs796688738
chr4 10128 10128 rs796688738
chr5 10138 10139 rs368469931
chr6 10146 10147 rs779258992
chr7 10165 10165 rs796884232
chr8_KI270718v1_random 10149 10150 rs371194064
chr9_GL000221v1_random 10144 10145 rs144773400
chr10_KI270879v1_alt 10055 10055 rs768019142
chr11_KI270714v1_random 10107 10108 rs62651026
我想删除第一列中以" _"开头的部分从包含此的行。所以我希望输出看起来像:
chr1 10019 10020 rs775809821
chr2 10108 10109 rs376007522
chr3 10128 10128 rs796688738
chr4 10128 10128 rs796688738
chr5 10138 10139 rs368469931
chr6 10146 10147 rs779258992
chr7 10165 10165 rs796884232
chr8 10149 10150 rs371194064
chr9 10144 10145 rs144773400
chr10 10055 10055 rs768019142
chr11 10107 10108 s62651026
我尝试过使用sed(sed 's/_\S*\s*/ /' infile > outfile
)这样做,但这只删除了" _"在包含我想要删除的字符串的行中。所以它看起来像这样:
chr1 10019 10020 rs775809821
chr2 10108 10109 rs376007522
chr3 10128 10128 rs796688738
chr4 10128 10128 rs796688738
chr5 10138 10139 rs368469931
chr6 10146 10147 rs779258992
chr7 10165 10165 rs796884232
chr8 KI270718v1_random 10149 10150 rs371194064
chr9 GL000221v1_random 10144 10145 rs144773400
chr10 KI270879v1_alt 10055 10055 rs768019142
chr11 KI270714v1_random 10107 10108 s62651026
如何只删除" _"中的部分行?仅在包含" chr#"之后的字符串的行中在第1栏?
答案 0 :(得分:1)
您可以使用:
awk 'BEGIN{FS=OFS="\t"} $1 ~ /chr/{sub(/_.*$/, "", $1)} 1' file
<强>输出:强>
chr1 10019 10020 rs775809821
chr2 10108 10109 rs376007522
chr3 10128 10128 rs796688738
chr4 10128 10128 rs796688738
chr5 10138 10139 rs368469931
chr6 10146 10147 rs779258992
chr7 10165 10165 rs796884232
chr8 10149 10150 rs371194064
chr9 10144 10145 rs144773400
chr10 10055 10055 rs768019142
chr11 10107 10108 rs62651026
答案 1 :(得分:0)
你可以试试这个
sed -r 's/_\S+//' file
请注意,它不仅限于第一列。
答案 2 :(得分:0)
$ sed -r 's/^([^\t_]+)_[^\t]+/\1/' file
chr1 10019 10020 rs775809821
chr2 10108 10109 rs376007522
chr3 10128 10128 rs796688738
chr4 10128 10128 rs796688738
chr5 10138 10139 rs368469931
chr6 10146 10147 rs779258992
chr7 10165 10165 rs796884232
chr8 10149 10150 rs371194064
chr9 10144 10145 rs144773400
chr10 10055 10055 rs768019142
chr11 10107 10108 rs62651026