如何使用Perl从FASTA文件中提取序列?

时间:2015-03-16 06:24:42

标签: regex perl file-handling

我有一个包含许多蛋白质序列的FASTA文件。我需要阅读FASTA文件,删除标题并将序列保存在不同的变量中。关于如何在Perl中做这些建议(请不是Bio Perl)?

FASTA文件的示例:

gi|542264878|ref|XP_003460692.2| PREDICTED: myosin heavy chain, fast skeletal muscle-like, partial [Oreochromis niloticus|
KCFEKPKPAKGKAEAHFSLVHYAGTVDYNITGWLDKNKDPLNDSVVQLYQKSSNKLLALLYVAHAGGEEAGGGKKGGKKKGGSFQTVSALFRENLGKLMTNLRSTHPHFVRCLIPNETKTPGLMENFLVIHQLRCNGVLEGIRICRKGFPSRILYGDFKQRYKVLNASVIPEGQFIDNKKAS

我只想要序列:

KCFEKPKPAKGKAEAHFSLVHYAGTVDYNITGWLDKNKDPLNDSVVQLYQKSSNKLLALLYVAHAGGEEAGGGKKGGKKKGGSFQTVSALFRENLGKLMTNLRSTHPHFVRCLIPNETKTPGLMENFLVIHQLRCNGVLEGIRICRKGFPSRILYGDFKQRYKVLNASVIPEGQFIDNKKAS

1 个答案:

答案 0 :(得分:0)

如果awk对你来说没问题,那么这个简单的单行就可以了

# cat test 
gi|542264878|ref|XP_003460692.2| PREDICTED: myosin heavy chain, fast skeletal muscle-like, partial [Oreochromis niloticus| KCFEKPKPAKGKAEAHFSLVHYAGTVDYNITGWLDKNKDPLNDSVVQLYQKSSNKLLALLYVAHAGGEEAGGGKKGGKKKGGSFQTVSALFRENLGKLMTNLRSTHPHFVRCLIPNETKTPGLMENFLVIHQLRCNGVLEGIRICRKGFPSRILYGDFKQRYKVLNASVIPEGQFIDNKKAS

# awk '{print $NF}' test
KCFEKPKPAKGKAEAHFSLVHYAGTVDYNITGWLDKNKDPLNDSVVQLYQKSSNKLLALLYVAHAGGEEAGGGKKGGKKKGGSFQTVSALFRENLGKLMTNLRSTHPHFVRCLIPNETKTPGLMENFLVIHQLRCNGVLEGIRICRKGFPSRILYGDFKQRYKVLNASVIPEGQFIDNKKAS

以下是perl方式:

# perl -lane 'print $F[-1]' test 
KCFEKPKPAKGKAEAHFSLVHYAGTVDYNITGWLDKNKDPLNDSVVQLYQKSSNKLLALLYVAHAGGEEAGGGKKGGKKKGGSFQTVSALFRENLGKLMTNLRSTHPHFVRCLIPNETKTPGLMENFLVIHQLRCNGVLEGIRICRKGFPSRILYGDFKQRYKVLNASVIPEGQFIDNKKAS

请参阅此链接以获取每个单行说明:https://blogs.oracle.com/ksplice/entry/the_top_10_tricks_of