我有一个FASTA文件(testfile.fa),其中包含标题行(在开头包含>)和带有字符的行,表示某些类型的核苷酸(A,C,G,T,a,g,c, t,N)。
>chr1
cccccccccttttttttaaaa
AAAACCCCTTCCCCCCCCGGG
GGGGGGGGGGGGGGGGGGGGG
TTTTTTTTTTTTTTTTTTTTT
>chr1_alt
TCTCTCTCTCTCTCTCTCTCT
gggtttccccccccccccccc
CGCGCGCGCGCGCGCGCGCGC
CCCCCAAAAAAAAAAAAAAAA
>chr2
CCCCCCCCCCCCCCCCCCCCC
TTTTTTTTTTTTATTTTTTTT
>chr3
AAAAAAAAAAAAAAAAAAAAA
GGGGGGGGGGGGGGGGGGGGG
TTTTTTTTTTTTTTTTTTTTT
我需要逐行读取此文件,并在除标题之外的每个序列中将小字符(a,c,t,g)更改为N,其中包含>。所以我使用以下代码:
#!/bin/bash
while read line
do
if [[ $line =~ ">" ]]
then
echo $line
else
tr 'c' 'N'
echo $line
fi
done < testfile.fa
但结果令人困惑:
>chr1
# the first line was missed
AAAACCCCTTCCCCCCCCGGG
GGGGGGGGGGGGGGGGGGGGG
TTTTTTTTTTTTTTTTTTTTT
>Nhr1_alt #the character was changed but the line contains >
TCTCTCTCTCTCTCTCTCTCT
gggtttNNNNNNNNNNNNNNN
CGCGCGCGCGCGCGCGCGCGC
CCCCCAAAAAAAAAAAAAAAA
>Nhr2 #the character was changed but the line contains >
CCCCCCCCCCCCCCCCCCCCC
TTTTTTTTTTTTATTTTTTTT
>Nhr3 #the character was changed but the line contains >
AAAAAAAAAAAAAAAAAAAAA
GGGGGGGGGGGGGGGGGGGGG
TTTTTTTTTTTTTTTTTTTTTcccccccccttttttttaaaa #the first line from the first sequence comes here
这些问题的可能原因是什么,我该如何解决?提前谢谢!
答案 0 :(得分:1)
使用awk:
$ awk '/^[^>]/{gsub(/[actg]/,"N")}1' file
>chr1
NNNNNNNNNNNNNNNNNNNNN
AAAACCCCTTCCCCCCCCGGG
GGGGGGGGGGGGGGGGGGGGG
TTTTTTTTTTTTTTTTTTTTT
>chr1_alt
TCTCTCTCTCTCTCTCTCTCT
NNNNNNNNNNNNNNNNNNNNN
CGCGCGCGCGCGCGCGCGCGC
CCCCCAAAAAAAAAAAAAAAA
>chr2
CCCCCCCCCCCCCCCCCCCCC
TTTTTTTTTTTTATTTTTTTT
>chr3
AAAAAAAAAAAAAAAAAAAAA
GGGGGGGGGGGGGGGGGGGGG
TTTTTTTTTTTTTTTTTTTTT
说明:
/^[^>]/ { # if the record starts with anything but >
gsub(/[actg]/,"N") # replace all actg with N
}1 # output
答案 1 :(得分:0)
您以错误的方式使用tr
。
这是我的剧本:
#!/bin/sh
while read line
do
if [[ $line =~ ">" ]]
then
echo $line
else
echo $line | tr 'c' 'N'
fi
done < t.file
答案 2 :(得分:0)
要使用awk语句更改所有小写变量,我们可以使用:
awk '{ if (substr($0,1,1) != ">") { stat="";for ( i=1;i<=length($0);i++ ) { if ( substr($0,i,1) ~ /[[:lower:]]/ ) { stat=stat"N" } else stat=stat substr($0,i,1) } print stat } else { print $0 } }' testfile.fa
我们使用awk的substr函数,只需用&gt;打印任何行。作为第一个角色。使用其他行,我们构建一个变量stat,将所有小写字母更改为N,然后打印最终的stat结果。
答案 3 :(得分:0)
sed是实现这一目标的简单方法:
sed -i '/^>/ !s/[actg]/N/g' testfile.fa
[]包含将更改为N的字符,/^>/ !
部分忽略以&gt;开头的行
-i将覆盖当前文件,如果没有它,您将获得标准输出的输出。