我的程序有这个输入:
miRNA127 dvex589433 131 154 - 24 87.5 atcgtaacgtatctcccacactta 32 55 98
miRNA32 dvex320240 61 83 - 23 86.9565217391304 cttctacaatggtactgtccatt 31 53 97
miRNA32 dvex623745 141 163 - 23 86.9565217391304 ggtttcttccacaatagtaattt 26 48 97
miRNA79 dvex468096 702 733 - 32 81.25 ttggttaaaaatttttttttttaattaaaaaa 6 37 55
miRNA79 dvex468096 717 743 + 27 81.4814814814815 aaaaaatttttaaccaaagaaaaaaat 13 39 55
miRNA79 dvex468096 694 718 - 25 84 tttttttaattaaaaaacaattttt 17 41 55
miRNA79 dvex468096 696 724 + 29 75.8620689655172 aaattgttttttaattaaaaaaaaaaatt 13 41 55
miRNA79 dvex219016 1103 1130 + 28 78.5714285714286 aaatttttgctaaaaaatacaaaaattt 14 41 55
miRNA79 dvex219016 3420 3446 + 27 77.7777777777778 aaaatattattaaataaataatgcaat 13 39 55
miRNA79 dvex219016 1384 1408 + 25 80 tttcgtgaaacaaaaaagtttggaa 21 45 55
miRNA79 dvex219016 4384 4424 + 25 80 tttcgtgaaacaaaaaagtttggaa 21 45 55
miRNA154 dvex573491 297 324 + 28 78.5714285714286 cagcttgattttaagcctatctgaaagc 23 50 76
miRNA154 dvex546562 232 259 + 28 78.5714285714286 cagcttgattttaagcctatttgaaagc 23 50 76
miRNA154 dvex648254 147 172 + 26 80.7692307692308 aagcctacggagtgcgaggcagagct 47 72 76
miRNA154 dvex648254 277 303 + 26 80.7692307692308 aagcctacggagtgcgaggcagagct 47 72 76
我需要分组,如果有相同的$ 1,$ 2和$ 5值。因此我决定使用具有不同嵌套数组的哈希:
$VAR1 = {
'miRNA79 dvex219016 +' => [
[ '1103', '1130', '14', '41', '55' ],
[ '3420', '3446', '13', '39', '55' ],
[ '1384', '1408', '21', '45', '55' ],
[ '4384', '4424', '21', '45', '55' ]
],
'miRNA79 dvex468096 +' => [
[ '717', '743', '13', '39', '55' ],
[ '696', '724', '13', '41', '55' ]
],
'miRNA154 dvex546562 +' => [ [ '232', '259', '23', '50', '76' ] ],
'miRNA79 dvex468096 -' => [
[ '702', '733', '6', '37', '55' ],
[ '694', '718', '17', '41', '55' ]
],
'miRNA154 dvex648254 +' => [
[ '147', '172', '47', '72', '76' ],
[ '277', '303', '47', '72', '76' ]
],
'miRNA127 dvex589433 -' => [ [ '131', '154', '32', '55', '98' ] ],
'miRNA154 dvex573491 +' => [ [ '297', '324', '23', '50', '76' ] ],
'miRNA32 dvex320240 -' => [ [ '61', '83', '31', '53', '97' ] ],
'miRNA32 dvex623745 -' => [ [ '141', '163', '26', '48', '97' ] ]
};
之后,我针对散列的每个键组织了嵌套数组的[0] - > [0]值。如果嵌套数组有1个数组我打印它。但是如果有1<我需要分组。接下来,我展示了一个分组示例:
'miRNA79 dvex468096 -' => [
[ '702', '733', '6', '37', '55' ],
[ '694', '718', '17', '41', '55' ]
],
组织它:
$VAR1 = [ [ 696, '724', '13', '41', '55' ],
[ 717, '743', '13', '39', '55' ] ];
如果[1] [1]和[0] [0]之间的差异小于或等于[0] [4],我需要将它组合起来并生成这个新数组:
$VAR1 = [ [ 696, '743', '13', '39', '55' ], ];
并打印出来。在这种情况下:
$VAR1 = [
[ 1103, '1130', '14', '41', '55' ],
[ 1384, '1408', '21', '45', '55' ],
[ 3420, '3446', '13', '39', '55' ],
[ 4384, '4424', '21', '45', '55' ]
];
评估[1] [1]和[0] [0]是否小于或等于[0] [4],为FALSE,所以我需要提取第一个嵌套数组并打印它,然后再次迭代到评估最后的条件。如果它生成一个TRUE值我需要组合,如果评估生成一个FALSE值,我需要提取firts嵌套数组并打印它。接下来,我的代码:
#!/usr/bin/perl
use strict;
use warnings;
use Data::Dumper;
use List::Util qw/ min max /;
use List::Util qw(sum);
use Math::MatrixReal;
my %data;
my $val;
my $num;
my $start;
my $end;
my $diff;
my $start_q;
my $end_q;
my @new_data;
my @extract;
my @extract2;
my $limit;
while (<>) {
chomp;
my @fields = split;
push @{ $data{"@fields[0,1,4]"} }, [ @fields[ 2, 3, 8, 9, 10 ] ];
}
foreach my $key ( sort keys %data ) {
$val = $data{$key};
$num = scalar @$val;
next if $num == 0;
if ( $num == 1 ) { # print if the hash have 1 nested array
print
"$key\t $data{$key}[0][0]\t $data{$key}[0][1]\t $data{$key}[0][2]\t $data{$key}[0][3]\t $data{$key}[0][4]\n";
}
else {
foreach my $keys ( @$val[0] ) {
my @sorted = sort { $a->[0] <=> $b->[0] }
@$val; #organize the nested array values
$start = $sorted[0][0];
$end = $sorted[1][1];
$limit = $sorted[0][4];
$diff = $end - $start;
$start_q = $sorted[0][2];
$end_q = $sorted[1][3];
if ( $diff < $limit ) {
@new_data = ();
push( @new_data, $start );
push( @new_data, $end );
push( @new_data, $start_q );
push( @new_data, $end_q );
push( @new_data, $limit );
@extract = splice( @{ $sorted[0] }, 0, 5, @new_data );
@extract2 = splice( @{ $sorted[1] } );
}
else {
my @toprint = splice( @{ $sorted[0] } );
print
"$key\t$toprint[0]\t$toprint[1]\t$toprint[2]\t$toprint[3]\t$toprint[4]\n";
}
}
}
}
一般来说,我有这个结果:
miRNA127 dvex589433 - 131 154 32 55 98
miRNA154 dvex546562 + 232 259 23 50 76
miRNA154 dvex573491 + 297 324 23 50 76
miRNA154 dvex648254 + 147 172 47 72 76
miRNA32 dvex320240 - 61 83 31 53 97
miRNA32 dvex623745 - 141 163 26 48 97
miRNA79 dvex219016 + 1103 1130 14 41 55
但是在这些列表中,某些值不会出现,因为如果条件为TRUE,我的代码不会迭代。一些建议?
答案 0 :(得分:0)
我不确定,但我认为你试图将一些RNA序列(?)合并到一个足够接近的时候(结果长度小于某个极限)。您可能正在寻找这样的代码:
#!/usr/bin/perl
use strict;
use warnings;
# Input data format positions
use constant KEY_FIELDS => ( 0, 1, 4 );
use constant DATA_FIELDS => ( 2, 3, 8, 9, 10 );
# Entry positions (DATA_FIELDS meanings)
use constant {
START_P => 0,
END_P => 1,
START_Q => 2,
END_Q => 3,
LIMIT => 4
};
# Output formatter
use constant TO_PRINT => START_P .. LIMIT;
sub format_entry {
my ( $key, $data ) = @_;
join "\t", $key, @$data[TO_PRINT];
}
# Read Data
my %data;
while (<>) {
chomp;
my @fields = split;
push @{ $data{"@fields[KEY_FIELDS]"} }, [ @fields[DATA_FIELDS] ];
}
# Transform data to keep only records supposed to appear in output
for my $value ( values %data ) {
my @entries = sort { $a->[START_P] <=> $b->[START_P] } @$value;
my @result = ( shift @entries ); # add first one as reference
while (@entries) {
my $ref = $result[-1]; # reference entry
my $entry = shift @entries;
if ( $entry->[END_P] - $ref->[START_P] < $ref->[LIMIT] ) {
# merge entry into reference
@$ref[ END_P, END_Q ] = @$entry[ END_P, END_Q ];
}
else {
push @result, $entry;
}
}
$value = \@result; # rewrite value in %data hash
}
# Write output
for my $key ( sort keys %data ) {
print format_entry( $key, $_ ), "\n" for @{ $data{$key} };
}
您问题中数据的结果是:
miRNA127 dvex589433 - 131 154 32 55 98
miRNA154 dvex546562 + 232 259 23 50 76
miRNA154 dvex573491 + 297 324 23 50 76
miRNA154 dvex648254 + 147 172 47 72 76
miRNA154 dvex648254 + 277 303 47 72 76
miRNA32 dvex320240 - 61 83 31 53 97
miRNA32 dvex623745 - 141 163 26 48 97
miRNA79 dvex219016 + 1103 1130 14 41 55
miRNA79 dvex219016 + 1384 1408 21 45 55
miRNA79 dvex219016 + 3420 3446 13 39 55
miRNA79 dvex219016 + 4384 4424 21 45 55
miRNA79 dvex468096 + 696 743 13 39 55
miRNA79 dvex468096 - 694 733 17 37 55