如何逐行读取文件,如果行包含特定值,则更改字符

时间:2017-07-05 10:18:23

标签: bash while-loop fasta

我有一个FASTA文件(testfile.fa),其中包含标题行(在开头包含>)和带有字符的行,表示某些类型的核苷酸(A,C,G,T,a,g,c, t,N)。

>chr1
cccccccccttttttttaaaa
AAAACCCCTTCCCCCCCCGGG
GGGGGGGGGGGGGGGGGGGGG
TTTTTTTTTTTTTTTTTTTTT
>chr1_alt
TCTCTCTCTCTCTCTCTCTCT
gggtttccccccccccccccc
CGCGCGCGCGCGCGCGCGCGC
CCCCCAAAAAAAAAAAAAAAA
>chr2
CCCCCCCCCCCCCCCCCCCCC
TTTTTTTTTTTTATTTTTTTT
>chr3
AAAAAAAAAAAAAAAAAAAAA
GGGGGGGGGGGGGGGGGGGGG
TTTTTTTTTTTTTTTTTTTTT

我需要逐行读取此文件,并在除标题之外的每个序列中将小字符(a,c,t,g)更改为N,其中包含>。所以我使用以下代码:

#!/bin/bash 
while read line
do
    if [[ $line =~ ">" ]]
    then
        echo $line
    else
        tr 'c' 'N'
        echo $line
    fi
done < testfile.fa

但结果令人困惑:

>chr1
# the first line was missed
AAAACCCCTTCCCCCCCCGGG
GGGGGGGGGGGGGGGGGGGGG
TTTTTTTTTTTTTTTTTTTTT
>Nhr1_alt #the character was changed but the line contains >
TCTCTCTCTCTCTCTCTCTCT
gggtttNNNNNNNNNNNNNNN
CGCGCGCGCGCGCGCGCGCGC
CCCCCAAAAAAAAAAAAAAAA
>Nhr2 #the character was changed but the line contains >
CCCCCCCCCCCCCCCCCCCCC
TTTTTTTTTTTTATTTTTTTT
>Nhr3 #the character was changed but the line contains >
AAAAAAAAAAAAAAAAAAAAA
GGGGGGGGGGGGGGGGGGGGG
TTTTTTTTTTTTTTTTTTTTTcccccccccttttttttaaaa #the first line from the first sequence comes here

这些问题的可能原因是什么,我该如何解决?提前谢谢!

4 个答案:

答案 0 :(得分:1)

使用awk:

$ awk '/^[^>]/{gsub(/[actg]/,"N")}1' file
>chr1
NNNNNNNNNNNNNNNNNNNNN
AAAACCCCTTCCCCCCCCGGG
GGGGGGGGGGGGGGGGGGGGG
TTTTTTTTTTTTTTTTTTTTT
>chr1_alt
TCTCTCTCTCTCTCTCTCTCT
NNNNNNNNNNNNNNNNNNNNN
CGCGCGCGCGCGCGCGCGCGC
CCCCCAAAAAAAAAAAAAAAA
>chr2
CCCCCCCCCCCCCCCCCCCCC
TTTTTTTTTTTTATTTTTTTT
>chr3
AAAAAAAAAAAAAAAAAAAAA
GGGGGGGGGGGGGGGGGGGGG
TTTTTTTTTTTTTTTTTTTTT

说明:

/^[^>]/ {               # if the record starts with anything but >
    gsub(/[actg]/,"N")  # replace all actg with N
}1                      # output

答案 1 :(得分:0)

您以错误的方式使用tr

这是我的剧本:

#!/bin/sh

while read line
do
    if [[ $line =~ ">" ]]
    then
        echo $line
    else
        echo $line | tr 'c' 'N'
    fi
done < t.file

答案 2 :(得分:0)

要使用awk语句更改所有小写变量,我们可以使用:

awk '{ if (substr($0,1,1) != ">") { stat="";for ( i=1;i<=length($0);i++ ) { if ( substr($0,i,1) ~ /[[:lower:]]/ ) { stat=stat"N" } else stat=stat substr($0,i,1) } print stat } else { print $0 } }' testfile.fa

我们使用awk的substr函数,只需用&gt;打印任何行。作为第一个角色。使用其他行,我们构建一个变量stat,将所有小写字母更改为N,然后打印最终的stat结果。

答案 3 :(得分:0)

sed是实现这一目标的简单方法:

sed -i '/^>/ !s/[actg]/N/g' testfile.fa

[]包含将更改为N的字符,/^>/ !部分忽略以&gt;开头的行

-i将覆盖当前文件,如果没有它,您将获得标准输出的输出。