用于在fasta文件中搜索的AWK脚本

时间:2014-07-16 10:15:07

标签: bash awk fasta

我有这样一个fasta文件:

>gnl|SRA|SRR035294.8571.2 FIHSSUW01ASCWS.2 length=224
GAGATGAAATAGATCTTGGCATATATGTACATGCTTGATCTCAGTTTTGATTGGATTTTATCCATTTTAG
CTATCTTAACTATTAATCTTGAAATGAAGCTTTAATTTATGTAGGAAGTTTATGAAATTTAGGAAAAAAA
AAGAAAAAAACAAAACAATGTCGGCCGCCTCGGTCTCTACTGAGACACGCAACAGGGGATAGGCAAGGCA
CACAGGGGATAGGN
>gnl|SRA|SRR035294.8572.2 FIHSSUW01ETZME.2 length=254
ACTAACCAGGTGGTAAACAACTACTACAGGCCAGATTTGAAGAAGGCTGCTCTTGCTAGATTGAGTGCAG
TGAACAGAAGCCTTAAGGTTTCAAAGTCTGGTGTGAAGAAGAAGAACAGACAGGCAGTTAGGATCCATGG
TAGGAAGTGAAGCTGTGATTTGCCTACCGTCTGATATTCATCGTATCACTTTCTAGCTGTTCCGTCTTGT
TTGGCAAGTGTTTGGTTTTACGTGCGAGTAGTTATATGTTGCGC
>gnl|SRA|SRR035294.8573.2 FIHSSUW01AZA99.2 length=230
AAGCAGTGGTATCAACGCAGAGTGGCCATTACGGCCGGGGATGTACCAATTCAAAAAGAAAACAGCAGTT
GGGGGCAAAACAATTAAGTTGTAACGAATGCATATATATGATTAATCTTCTAACACATTATTTTTGTCTC
AAAAAAAAAGAAAAAAAACAAAACATGTCGGCCGCCTCGGTCTCTACTGAGACACGCAACAGGGGATAGG
CAAGGCACACAGGGGATAGG
>gnl|SRA|SRR035294.8574.2 FIHSSUW01EHI3P.2 length=153
TGCAAGTTTACAACTTAAAACAACTTTTCTCACAGTGAACAATAAATTTATCAATTCTCATGCAAAAAAA
AAGAAAAAAACAAAAACATGTCGGCCGCCTCGGTCTCTACTGAGACACGCAACAGGGGATAGGCAAGGCA
CACAGGGGATAGG
>gnl|SRA|SRR035294.8575.2 FIHSSUW01EWK4S.2 length=287
AACAGTGGTATCAACGCAGAGTGGCCATTACGGCCGGGAGATTACAGGTATTGCAAGTTTCAAGCCTGTC
ATAAAGACTCAAAGCCGCTTGTAATTTGTGTTTCCTAGTTGGGGAAGCTGTTTGTTCTTTATTGTGCTAT
ATGTATTTATTTGAAAGTTTGGATGAACTCAATAAATAAAAGAAAATCTTCATTGTGGGTTACAATTTGG
ACATGAACATGCATGAATAATGTACCAATTTAGCAAAAAAAAAGAAAAAAACAAAAAACAAATAGTCGGC
CGGCCCG
>gnl|SRA|SRR035294.8576.2 FIHSSUW01C911A.2 length=265
TATTCTCAGGTACGAAATATGAGTTTGCTGATAAATTGATGGATTGGGAATCAGCCTGCATAATAAGATA
TTCCCAATTAACTTTGCCCGTTAGTTCTTTTAGCTTTTCCTTTAAAGGCACGAGTCTTTCAACCAAAACA
TTACAGCAAAGTCTAACTGCCTCACAGCTTGCTTCAGAAGTTGTACCCCCGGCCGTAATGGCCACTCTGC
GTTGATACCACTGCTTCTGAGACACGCAACAGGGGATAGGCAAGGCACACAGGGG

我已经用bash编写了这个脚本

STRING=$1
FILE=$(pwd)"/"$2

if [ -z "$STRING" ] 
then 
    echo "Usage: fastaFind.sh <query> <fasta file>"
else
    echo ""
    awk  'BEGIN { RS = ">" } ; $0 ~ "'$STRING'" { print $0 }' "$FILE"
fi

我正在运行此命令

 fastaFind.sh "gnl|SRA|SRR035294.8573.2 FIHSSUW01AZA99.2 length=230" file.fasta

但它为未终止的字符串返回错误。我想要实现的是在执行命令后检索查询的特定序列。 e.g

>gnl|SRA|SRR035294.8573.2 FIHSSUW01AZA99.2 length=230
AAGCAGTGGTATCAACGCAGAGTGGCCATTACGGCCGGGGATGTACCAATTCAAAAAGAAAACAGCAGTT
GGGGGCAAAACAATTAAGTTGTAACGAATGCATATATATGATTAATCTTCTAACACATTATTTTTGTCTC
AAAAAAAAAGAAAAAAAACAAAACATGTCGGCCGCCTCGGTCTCTACTGAGACACGCAACAGGGGATAGG

3 个答案:

答案 0 :(得分:1)

有几个问题需要解决。

  1. 引用错误。由于您正在调用STRING调用中的shell awk变量,因此整个awk命令必须用双引号括起来。但是你必须在<{1}}命令
  2. 里面删除双引号
  3. 无法使用匹配运算符awk,因为该模式包含在正则表达式中具有特殊含义的字符(如~)。因此,您需要一种方法来匹配输入记录的一部分;这就是|进行比较背后的原因(通过重新定义$1来实现)。
  4. FS

答案 1 :(得分:1)

或者就是这样:

awk -v "RS=>" '/length=254/ { print $0; }' file

答案 2 :(得分:1)

您的awk命令最好是:

awk 'BEGIN{ ORS = ""; RS = ">"; FS="\n" } $1 == "pattern" { print ">" $0 }' file

或者

awk -v p="pattern" 'BEGIN {ORS = ""; RS = ">"; FS = "\n" } $1 == p { print ">" $0 }' file

你的shell脚本是:

#!/bin/bash

STRING=$1
FILE=$2

if [[ -z $STRING ]]; then
    echo "Usage: fastaFind.sh <query> <fasta file>"
else
    awk -v p="$STRING" 'BEGIN{ ORS=""; RS=">"; FS="\n" } $1 == p { print ">" $0 }' "$FILE"
fi

使用示例:

bash temp.sh 'gnl|SRA|SRR035294.8575.2 FIHSSUW01EWK4S.2 length=287' temp.txt

输出:

>gnl|SRA|SRR035294.8575.2 FIHSSUW01EWK4S.2 length=287
AACAGTGGTATCAACGCAGAGTGGCCATTACGGCCGGGAGATTACAGGTATTGCAAGTTTCAAGCCTGTC
ATAAAGACTCAAAGCCGCTTGTAATTTGTGTTTCCTAGTTGGGGAAGCTGTTTGTTCTTTATTGTGCTAT
ATGTATTTATTTGAAAGTTTGGATGAACTCAATAAATAAAAGAAAATCTTCATTGTGGGTTACAATTTGG
ACATGAACATGCATGAATAATGTACCAATTTAGCAAAAAAAAAGAAAAAAACAAAAAACAAATAGTCGGC
CGGCCCG