将蛋白质序列编入ensembl基因Id文件夹

时间:2015-09-02 10:16:04

标签: perl fasta

我有一个特定的“summary.txt”文件(带有> 300行的制表符分隔文件),其中每行代表一个肽信息,Tab4是ensembl基因ID,Tab5是ensembl蛋白ID。因此,一个基因ID可能含有多种蛋白质。

我的“summary.txt”文件看起来像这样..

Genus Species taxon_id gene_member_id stable_id

Homo sapiens 9606 9131292 ENSP00000426290
大猩猩大猩猩9595 9131292 ENSGGOP00000018925
....

我有一个脚本贯穿.txt文件的每一行,并将每个集合蛋白ID的fasta序列写入各自的文件。

我想要的是将fasta文件写入与ensembl gene id相对应的文件夹中。

提前致谢..

我的代码如下,

my @ensembl_gene_id;
my @ensembl_prot_id;

Bio::EnsEMBL::Registry->load_registry_from_db(-host => 'ensembldb.ensembl.org', -user => 'anonymous'); # Registry loading

my $input_file="/path/to/my/input/file/summary.txt" || die "Insufficient Parameters!!!\n";

open (IN, "<$input_file") || die "$! $input_file\n";

while (my $line = <IN>) {
chomp $line;

## skip comments and blank lines and optional repeat of title line
next if $line =~ /^\#/ || $line =~ /^\Common_name/;

    my @tab = split("\t", $line);

    @ensembl_gene_id = $tab[4]; # Initialisation of Ensembl_gene_id
    trim @ensembl_gene_id;

    foreach my $genes(@ensembl_gene_id) {
        print "The gene ID is: $genes\n";
        mkdir $genes; # Creates folders specific for ensembl_gene_id

        # Writing sequence files
        @ensembl_prot_id=$tab[5];  # Initialisation of Ensembl_prot_id
        trim @ensembl_prot_id;

        foreach my $ID (@ensembl_prot_id) {
            open(PROT, ">$ID\_out.fa") || die "Can't open $ID\_out.fa\n"; # specifying Output file
            print "Protein ID:$ID\n";
            # fetch the member
            my $seqmember_adaptor = Bio::EnsEMBL::Registry->get_adaptor('Multi','compara','SeqMember');
            my $seqmember = $seqmember_adaptor->fetch_by_stable_id($ID);
            print PROT $seqmember -> sequence(), "\n";
            close PROT;

            }

        close DIR;

    }        

}

0 个答案:

没有答案