在给定包含序列名称的第二个文件的情况下,使用AWK搜索fasta文件

时间:2016-07-20 21:45:24

标签: awk fasta

我有2个文件。一个是包含多个fasta序列的fasta文件,而另一个文件包含我想要搜索的候选序列的名称(文件示例如下)。

seq.fasta

>Clone_18
GTTACGGGGGACACATTTTCCCTTCCAATGCTGCTTTCAGTGATAAATTGAGCATGATGGATGCTGATAATATCATTCCCGTGT
>Clone_23
GTTACGGGGGGCCGAAAAACACCCAATCTCTCTCTCGCTGAAACCCTACCTGTAATTTGCCTCCGATAGCCTTCCCCGGTGA
>Clone_27-1
GTTACGGGGACCACACCCTCACACATACAAACACAAACACTTCAAGTGACTTAGTGTGTTTCAGCAAAACATGGCTTC
>Clone_27-2
GTTACGGGGACCACACCCTCACACATACAAACACAAACACTTCAAGTGACTTAGTGTGTTTCAGCAAAACATGGCTTCGTTTTGTTCTAGATTAACTATCAGTTTGGTTCTGTTTGTCCTCGTACTGGGTTGTGTCAATGCACAACTT
>Clone_34-1
GTTACGGGGGAATAACAAAACTCACCAACTAACAACTAACTACTACTTCACTTTTCAACTACTTTACTACAATACTAAGAATGAAAACCATTCTCCTCATTATCTTTGCTCTCGCTCTTTTCACAAGAGCTCAAGTCCCTGGCTACCAAGCCATCG
>Clone_34-3
GTTACGGGGGAATAACAAAACTCACCAACTAACAACTAACTACTACTTCACTTTTCAACTACTTTACTACAATACTAAGAATGAAAACCATTCTCCTCATTATCTTTGCTCTCGCTCTTTTCACAAGAGCTCAAGTCCCTGGCTACCAAGCCATCGATATCGCTGAAGCCCAATC
>Clone_44-1
GTTACGGGGGAATCCGAATTCACAGATTCAATTACACCCTAAAATCTATCTTCTCTACTTTCCCTCTCTCCATTCTCTCTCACACACTGTCACACACATCC
>Clone_44-3
GTTACGGGGGAATCCGAATTCACAGATTCAATTACACCCTAAAATCTATCTTCTCTACTTTCCCTCTCTCCATTCTCTCTCACACACTGTCACACACATCCCGGCAGCGCAGCCGTCGTCTCTACCCTTCACCAGGAATAAGTTTATTTTTCTACTTAC

name.txt

Clone_23
Clone_27-1

我想使用AWK搜索fasta文件,并获取名称保存在另一个文件中的给定候选者的所有fasta序列。

awk 'NR==FNR{a[$1]=$1} BEGIN{RS="\n>"; FS="\n"} NR>FNR {if (match($1,">")) {sub(">","",$1)} for (p in a) {if ($1==p) print ">"$0}}' name.txt seq.fasta

问题是我只能在name.txt中提取第一个候选者的序列,就像这个

>Clone_23
GTTACGGGGGGCCGAAAAACACCCAATCTCTCTCTCGCTGAAACCCTACCTGTAATTTGCCTCCGATAGCCTTCCCCGGTGA

任何人都可以帮忙修复上面的单行awk命令吗?

2 个答案:

答案 0 :(得分:2)

如果确定或甚至想要打印名称,您只需使用<!-- index.html --> <!doctype html> <!-- ASSIGN OUR ANGULAR MODULE --> <html ng-app="diary"> <head> <!-- META --> <meta charset="utf-8"> <meta name="viewport" content="width=device-width, initial-scale=1"><!-- Optimize mobile viewport --> <title>Mountain Diary</title> <!-- SCROLLS --> <link rel="stylesheet" href="//netdna.bootstrapcdn.com/bootstrap/3.3.6/css/bootstrap.min.css"><!-- load bootstrap --> <style> html { overflow-y:scroll; } body { padding-top:50px; } #course-list { margin-bottom:30px; } </style> <script src="https://cdnjs.cloudflare.com/ajax/libs/jquery/2.2.4/jquery.min.js"></script> <script src="https://ajax.googleapis.com/ajax/libs/angularjs/1.5.6/angular.js"></script><!-- load angular --> <script src="http://netdna.bootstrapcdn.com/bootstrap/3.3.6/js/bootstrap.min.js" type="text/javascript"></script> <script src="https://cdnjs.cloudflare.com/ajax/libs/ng-table/1.0.0/ng-table.js"></script> <link rel="stylesheet" href="https://cdnjs.cloudflare.com/ajax/libs/ng-table/1.0.0/ng-table.css"> <script src="core.js"></script> </head> <!-- SET THE CONTROLLER AND GET ALL TODOS --> <body ng-controller="mainController"> <div class="container"> <!-- HEADER AND TODO COUNT --> <div class="jumbotron text-center"> <h1>Mountain Diary - courses: <span class="label label-info">{{ courses.length }}</span></h1> </div> <table ng-table="tableParams" class="table-striped" show-filter="true"> <tr ng-repeat="course in data"> <td data-title="'id'" sortable="'id'">{{course.id}}</td> <td data-title="'date'" sortable="'date'">{{course.date | date}}</td> <td data-title="'courseType'" sortable="'course.courseType'" filter="{'course.courseType': 'text'}">{{course.courseType | uppercase }} </td> <td data-title="'place'">{{course.place}}</td> <td data-title="'partners'">{{course.partners}}</td> <td data-title="'description'">{{course.description}}</td> <td data-title="'descriptionDetail'">{{course.descriptionDetail}}</td> <td data-title="'descriptionUrl'">{{course.descriptionUrl}}</td> <td data-title="'photoUrl'">{{course.photoUrl}}</td> </tr> </table> </div> </body> </html>

grep
  • grep -Ff name.txt -A1 a.fasta -f name.txt
  • 中选择模式
  • name.txt将它们视为文字字符串而非正则表达式
  • -F打印匹配行以及后续行

如果输出中不需要这些名称,我只需输入另一个A1

grep

above_command | grep -v '>' 解决方案可能如下所示:

awk

在多行版本中有更好的解释:

awk 'NR==FNR{n[$0];next} substr($0,2) in n && getline' name.txt a.fasta

答案 1 :(得分:2)

$ awk 'NR==FNR{n[">"$0];next} f{print f ORS $0;f=""} $0 in n{f=$0}' name.txt seq.fasta
>Clone_23
GTTACGGGGGGCCGAAAAACACCCAATCTCTCTCTCGCTGAAACCCTACCTGTAATTTGCCTCCGATAGCCTTCCCCGGTGA
>Clone_27-1
GTTACGGGGACCACACCCTCACACATACAAACACAAACACTTCAAGTGACTTAGTGTGTTTCAGCAAAACATGGCTTC