我有两个txt文件,然后我把它们放入hash'es,sequence =>标题。 在文件DUOMENYS.txt中已知标题,在文件“DUOTA.txt”中标题未知。 因此,对于文件“DUOTA.txt”中的每个序列,我需要在文件DUOMENYS.txt中找到类似的序列,然后打印该已知标题。 我尝试使用slimple匹配,使用超过90%的序列符号匹配打印标题,但我被告知这是错误的,我必须以其他方式执行此操作,使用此表:http://www.ncbi.nlm.nih.gov/Class/FieldGuide/BLOSUM62.txt 我必须比较来自已知序列和未知序列的字母并获得数字 (-4)-9,如果所有数字的总和=>序列长度* 3,打印标题
Example, ABW => TILE1, DUOMENYS.txt, ABF => UNKNOWNTITLE, DUOTA.txt,
A B W
A B F
4 4 1 sum = 9
length 3 x 3 = 9
9 => 9, true, print.
所以问题是我不知道如何实现它......
#!/usr/bin/perl
use strict;
use Data::Dumper;
open (OUTPUT, ">OUTPUT.txt") or die "$!"; #Turimos vairenio sekos
open (DUOMENYS, "DUOMENYS.txt") or die "$!";
open (OUTPUT1, ">OUTPUT1.txt") or die "$!"; #Tiriamos sekos
open (DUOTA, "DUOTA.txt") or die "$!";
open (OUTPUT2, ">OUTPUT2.txt") or die "$!"; #rezultatai
open (MATRIX, "MATRIX.txt") or die "$!";
#--------------------DUOMENYS-HASH-----------------------------
#my $contentFile = $ARGV[0];
my $contentFile = <DUOMENYS>;
my %testHash = ();
my $currentKey = "";
my $seka = "";
my %nhash = ();
open(my $contentFH,"<",$contentFile);
while(my $contentLine = <DUOMENYS>){
chomp($contentLine);
next if($contentLine eq ""); # Empty lines.
if($contentLine =~ /^\>(.*)/){
$testHash{$currentKey} = $seka;
$currentKey= $1;
$seka = "";
}else{
$seka .= $contentLine;
}
}
#-------------------DUOTA-HASH-------------------------------------
#my $contentFile1 = $ARGV[0];
my $contentFile1 = <DUOTA>;
my %testHash1 = ();
my $currentKey1 = "";
my $seka1 = "";
my %nhash1 = ();
open(my $contentFH1,"<",$contentFile1);
while(my $contentLine1 = <DUOTA>){
chomp($contentLine1);
next if($contentLine1 eq ""); # Empty lines.
if($contentLine1 =~ /^\>(.*)/){
$testHash1{$currentKey1} = $seka1;
$currentKey1= $1;
$seka1 = "";
}else{
$seka1 .= $contentLine1;
}
}
#--------------------OUTPUT-HASH------------------------------------
%nhash = reverse %testHash;
print OUTPUT Dumper(\%nhash);
%nhash1 = reverse %testHash1;
print OUTPUT1 Dumper(\%nhash1);
#---------------------MATCHING---------------------------------------
my $klaidu_skaicius = 0;
my @sekos = keys %nhash;
my @duotos_sekos = keys %nhash1;
my $i = 0;
my $j = 0;
for($i = 0; $i <= scalar@sekos; $i++){
for($j = 0; $j <= scalar@duotos_sekos; $j++){
$klaidu_skaicius = (@sekos[$i] ^ @duotos_sekos[$j]) =~ tr/\0//c;
if($klaidu_skaicius <= length(@sekos[$i])/10){
print OUTPUT2 substr( $nhash{@sekos[$i]}, 0, 9 ), "\n";
}
else{
print OUTPUT2 "";
}
}
}
pastebin.com/7QnBDTDY - povilito 5月30日11:57
对于pastebin.com来说太大了(15mb) - povilito 5月30日12:01
filedropper.com/duomenys - povilito 5月30日12:04
我认为比较“字母”和“”(空格)或“”应该给我们数字-4 - povilito 5月30日12:28
为一个未知序列找到一个标题是很有必要的。 - povilito 5月30日12:45
因此,如果有50个未知序列,输出文件应该给我们50个标题,一些标题可以是相同的:)
答案 0 :(得分:0)
根据我的理解,以下是您的问题解决方案的基本版本。我认为两个序列相等,如果需要,您可以轻松地进行更改。
#!/usr/bin/perl
use strict;
use warnings;
# convert your matrix file to 2D hash
my %hash;
open my $fh, '<', 'matrix' or die "unable to open file: $!\n";
chomp(my $heading = <$fh>);
# strip out space from begining
$heading =~ s/^\s+//g;
my @headings = split (/\s+/, $heading);
while(<$fh>) {
chomp;
my @line = split;
my $key1 = shift @line;
foreach my $i( 0 .. $#line) {
$hash{$key1}{$headings[$i]} = $line[$i];
}
}
close $fh;
# Took from your example for reference
my $duameny = "ABW";
my $duota = "ABF";
my @duamenys = split (//,$duameny);
my @duotas = split (//,$duota);
# calculate sum from hash
# considering both sequences are equal
my $sum = 0;
foreach my $i (0 .. $#duamenys) {
$sum += $hash{$duamenys[$i]}{$duotas[$i]};
}
# calculate length from sequence
my $length = (length $duameny) * 3;
print "SUM: $sum, Length: $length\n";
if($sum >= $length) {
# probably you know how to print the title
# print the title from duamenys.txt
}
以下是我的方法摘要。
sum >= length
。<强>输出:强>
SUM: 9, Length: 9