使用awk连接线

时间:2015-07-31 19:14:45

标签: awk concatenation

我有包含两个基因序列的fasta文件,我想要做的是删除fasta标题(以&#34开头的行;&gt;&#34;),连接其余的行并输出该序列< / p>

这是我的fasta序列(genome.fa):

>Potrs164783
AGGAAGTGTGAGATTGAAAAAACATTACTATTGAGGAATTTTTGACCAGATCAGAATTGAACCAACATGATGAAGGGGAT
TGTTTGCCATCAGAATATGGCATGAAATTTCTCCCCTAGATCGGTTCAAGCTCCTGTAGGTTTGGAGTCCTTAGTGAGAA
CTTTCTTAAGAGAATCTAATCTGGTCTGTTCCTCGTCATAAGTTAAAGAAAAACTTGAAACAAATAACAAGCATGCATAA
>Potrs164784
TTACCCTCTACCAGCACCAATGCCTATGATCTTACAAAAATCCTTAATAAAAAGAAATCCAAAACCATTGTTACCATTCC
GGAATTACATTCTGAGATAAAAACCCTCAAATCTGAATTACAATCCCTTAAACAAGCCCAACAAAAAGACTCTGCCATAC

期望的输出

AGGAAGTGTGAGATTGAAAAAACATTACTATTGAGGAATTTTTGACCAGATCAGAATTGAACCAACATGATGAAGGGGAT
TGTTTGCCATCAGAATATGGCATGAAATTTCTCCCCTAGATCGGTTCAAGCTCCTGTAGGTTTGGAGTCCTTAGTGAGAA
CTTTCTTAAGAGAATCTAATCTGGTCTGTTCCTCGTCATAAGTTAAAGAAAAACTTGAAACAAATAACAAGCATGCATAA
TTACCCTCTACCAGCACCAATGCCTATGATCTTACAAAAATCCTTAATAAAAAGAAATCCAAAACCATTGTTACCATTCC
GGAATTACATTCTGAGATAAAAACCCTCAAATCTGAATTACAATCCCTTAAACAAGCCCAACAAAAAGACTCTGCCATAC

我使用awk执行此操作但我收到此错误

awk 'BEGIN{filename="file1"}{if($1 ~ />/){filename=$1; sub(/>/,"",filename); print filename;} print $0 >filename.fa;}' ../genome.fa

awk: syntax error at source line 1
 context is
    BEGIN{filename="file1"}{if($1 ~ />/){filename=$1; sub(/>/,"",filename); print filename;} print $0 >>>  >filename. <<< fa;}
awk: illegal statement at source line 1

我基本上是一个蟒蛇人,我被某人给了这个剧本。我在这里做错了什么?

我意识到我不清楚,所以我粘贴了我从某人那里得到的整个代码。输入文件和所需输出保持不变

mkdir split_genome;
cd split_genome;
awk 'BEGIN{filename="file1"}{if($1 ~ />/){filename=$1; sub(/>/,"",filename); print filename;} print $0 >filename.fa;}' ../genome.fa;
ls -1 `pwd`/* > ../scaffold_list.txt;
cd ..;

3 个答案:

答案 0 :(得分:1)

The awk code that you show attempts to do something different than produce the output that you want. Fortunately, there are much simpler ways to obtain your desired output. For example:

$ grep -v '>' ../genome.fa
AGGAAGTGTGAGATTGAAAAAACATTACTATTGAGGAATTTTTGACCAGATCAGAATTGAACCAACATGATGAAGGGGAT
TGTTTGCCATCAGAATATGGCATGAAATTTCTCCCCTAGATCGGTTCAAGCTCCTGTAGGTTTGGAGTCCTTAGTGAGAA
CTTTCTTAAGAGAATCTAATCTGGTCTGTTCCTCGTCATAAGTTAAAGAAAAACTTGAAACAAATAACAAGCATGCATAA
TTACCCTCTACCAGCACCAATGCCTATGATCTTACAAAAATCCTTAATAAAAAGAAATCCAAAACCATTGTTACCATTCC
GGAATTACATTCTGAGATAAAAACCCTCAAATCTGAATTACAATCCCTTAAACAAGCCCAACAAAAAGACTCTGCCATAC

Alternatively, if you had intended to have all non-header lines concatenated into one line:

$ sed -n '/^>/!H; $!d; x; s/\n//gp' ../genome.fa
AGGAAGTGTGAGATTGAAAAAACATTACTATTGAGGAATTTTTGACCAGATCAGAATTGAACCAACATGATGAAGGGGATTGTTTGCCATCAGAATATGGCATGAAATTTCTCCCCTAGATCGGTTCAAGCTCCTGTAGGTTTGGAGTCCTTAGTGAGAACTTTCTTAAGAGAATCTAATCTGGTCTGTTCCTCGTCATAAGTTAAAGAAAAACTTGAAACAAATAACAAGCATGCATAATTACCCTCTACCAGCACCAATGCCTATGATCTTACAAAAATCCTTAATAAAAAGAAATCCAAAACCATTGTTACCATTCCGGAATTACATTCTGAGATAAAAACCCTCAAATCTGAATTACAATCCCTTAAACAAGCCCAACAAAAAGACTCTGCCATAC

答案 1 :(得分:1)

如果您只想生成问题中显示的所需输出,则其他解决方案将起作用。

但是,您拥有的脚本是尝试将每个序列打印到使用其标题和扩展名.fa命名的文件。

您获得的语法错误是因为filename.fa既不是变量也不是固定字符串。虽然没有Awk允许您打印到filename.fa,因为它既不是引号也不是变量(可变名称中不能包含.),BSD Awk不允许在操作字符串时它们目前作为GNU Awk的文件名。

所以解决方案:

print $0 > filename".fa"

会在BSD Awk中产生相同的错误,但在GNU Awk中会有效。

要解决此问题,您可以在分配时将附加信息".fa"附加到filename

这将完成这项工作:

$ awk '{if($0 ~ /^>/) filename=substr($0, 2)".fa"; else print $0 > filename}' file
$ cat Potrs164783.fa
AGGAAGTGTGAGATTGAAAAAACATTACTATTGAGGAATTTTTGACCAGATCAGAATTGAACCAACATGATGAAGGGGAT
TGTTTGCCATCAGAATATGGCATGAAATTTCTCCCCTAGATCGGTTCAAGCTCCTGTAGGTTTGGAGTCCTTAGTGAGAA
CTTTCTTAAGAGAATCTAATCTGGTCTGTTCCTCGTCATAAGTTAAAGAAAAACTTGAAACAAATAACAAGCATGCATAA
$ cat Potrs164784.fa
TTACCCTCTACCAGCACCAATGCCTATGATCTTACAAAAATCCTTAATAAAAAGAAATCCAAAACCATTGTTACCATTCC
GGAATTACATTCTGAGATAAAAACCCTCAAATCTGAATTACAATCCCTTAAACAAGCCCAACAAAAAGACTCTGCCATAC

您会注意到我遗漏了BEGIN{filename="file1"}声明声明,因为它是不必要的。此外,我使用字符串函数sub(...)替换了对substr的需求,因为它更清晰,所需操作更少。

答案 2 :(得分:0)

Try this to print lines not started by > and in one line:

awk '!/^>/{printf $0}' genome.fa > filename.fa

With carriage return:

awk '!/^>/' genome.fa > filename.fa

To create single files named by the headers:

awk 'split($0,a,"^>")>1{file=a[2];next}{print >file}' genome.fa