我想将字符串附加到FASTA文件中的序列标题。
>uce-101_seqname
GGCTGGCACCAGTTAACTTGGGATATTGGAGTGAAAAGGCCCGTAATCAGCCTTCGGTCATGTAGAACAATGCATAAAATTAAATTGACATTAATGAATAATTGTGTAATGAAAATGGA
>uce-101_seqname |uce-101
GGCTGGCACCAGTTAACTTGGGATATTGGAGTGAAAAGGCCCGTAATCAGCCTTCGGTCATGTAGAACAATGCATAAAATTAAATTGACATTAATGAATAATTGTGTAATGAAAATGGA
awk -F ">" '{if($2 ~ /^uce/){print $0 " |" substr($2,1,7)} else {print $0}}' <inputfile>
示例代码仅适用于7个字符(例如,uce-101)。我需要它才能工作更多且少于7个字符(例如,uce-1,uce-10,uce-1001)。
答案 0 :(得分:3)
我认为shellter在上面的评论中已经敲定了头。这样,您的awk行可以缩减为:
awk -F '>' '$2~/^uce/ { x=$2; sub(/_.*/,"",x); print $0, "|" x; next }1' file
结果:
>uce-101_seqname |uce-101
GGCTGGCACCAGTTAACTTGGGATATTGGAGTGAAAAGGCCCGTAATCAGCCTTCGGTCATGTAGAACAATGCATAAAATTAAATTGACATTAATGAATAATTGTGTAATGAAAATGGA
但是,如果您更喜欢sed解决方案,可以尝试:
sed '/^>uce/s/>\([^_]*\).*/& |\1/' file
结果:
>uce-101_seqname |uce-101
GGCTGGCACCAGTTAACTTGGGATATTGGAGTGAAAAGGCCCGTAATCAGCCTTCGGTCATGTAGAACAATGCATAAAATTAAATTGACATTAATGAATAATTGTGTAATGAAAATGGA
说明:
/^>uce/ # This is an address that specifies which lines are to be
# examined or modified. In this case, only lines beginning
# the string 'uce' are to be addressed.
s/../../ # Perform a substitution using the '/' delimiter
>\([^_]*\).* # This is the pattern to be matched. The '>' character is a
# literal '>'. Escaped parentheses are then used to capture
# a character class that says any character not an
# underscore any (zero or more) number of times. All this
# is then followed by any character any number of times.
& |\1 # This is the replacement string. The '&' character is the
# whole pattern that was found. This is followed by a
# literal space and a literal pipe character. '\1' is then
# our pattern that we kept using our escaped parentheses.
答案 1 :(得分:2)
这应该做:
awk -F">|_" 'NF>2 {$0=$0" |"$2}1' file
>uce-101_seqname |uce-101
GGCTGGCACCAGTTAACTTGGGATATTGGAGTGAAAAGGCCCGTAATCAGCCTTCGGTCATGTAGAACAATGCATAAAATTAAATTGACATTAATGAATAATTGTGTAATGAAAATGGA
将字段分隔符设置为>
或_
如果行包含两个以上的字段,请重新创建行
打印所有行。
如果你需要测试uce
,那么应该这样做:
awk -F">|_" '$2~/^uce/ {$0=$0" |"$2}1' file