我有一个特定的“summary.txt”文件(带有> 300行的制表符分隔文件),其中每行代表一个肽信息,Tab4是ensembl基因ID,Tab5是ensembl蛋白ID。因此,一个基因ID可能含有多种蛋白质。
我的“summary.txt”文件看起来像这样..
Genus Species taxon_id gene_member_id stable_id
Homo sapiens 9606 9131292 ENSP00000426290
大猩猩大猩猩9595 9131292 ENSGGOP00000018925
....
我有一个脚本贯穿.txt文件的每一行,并将每个集合蛋白ID的fasta序列写入各自的文件。
我想要的是将fasta文件写入与ensembl gene id相对应的文件夹中。
提前致谢..
我的代码如下,
my @ensembl_gene_id;
my @ensembl_prot_id;
Bio::EnsEMBL::Registry->load_registry_from_db(-host => 'ensembldb.ensembl.org', -user => 'anonymous'); # Registry loading
my $input_file="/path/to/my/input/file/summary.txt" || die "Insufficient Parameters!!!\n";
open (IN, "<$input_file") || die "$! $input_file\n";
while (my $line = <IN>) {
chomp $line;
## skip comments and blank lines and optional repeat of title line
next if $line =~ /^\#/ || $line =~ /^\Common_name/;
my @tab = split("\t", $line);
@ensembl_gene_id = $tab[4]; # Initialisation of Ensembl_gene_id
trim @ensembl_gene_id;
foreach my $genes(@ensembl_gene_id) {
print "The gene ID is: $genes\n";
mkdir $genes; # Creates folders specific for ensembl_gene_id
# Writing sequence files
@ensembl_prot_id=$tab[5]; # Initialisation of Ensembl_prot_id
trim @ensembl_prot_id;
foreach my $ID (@ensembl_prot_id) {
open(PROT, ">$ID\_out.fa") || die "Can't open $ID\_out.fa\n"; # specifying Output file
print "Protein ID:$ID\n";
# fetch the member
my $seqmember_adaptor = Bio::EnsEMBL::Registry->get_adaptor('Multi','compara','SeqMember');
my $seqmember = $seqmember_adaptor->fetch_by_stable_id($ID);
print PROT $seqmember -> sequence(), "\n";
close PROT;
}
close DIR;
}
}