我回来了另一个问题。我有一份数据清单:
1 L DIELTQSPE H EVQLQESDAELVKPGASVKISCKASGYTFTDHE
2 L DIVLTQSPRVT H EVQLQQSGAELVKPGASIKDTY
3 A ALQLTQSPSSLSAS B RITLKESGPPLVKPTCS C ELDKWAN
4 A ALQLTQSPSSLSAS B RITLKESGPPLVKPTCS C ELDKWAG
5 A ALQLTQSPSSLSAS B RITLKESGPPLVKPTCS C LELDKWASL
6 L DIQMTQIPSSLSASLSIC H EVQLQQSGVEVKMSCKASGYTFTS
7 L SYELTQPPSVSVSPGSIT H QVQLVQSAKGSGYSFS P YNKRKAFYTTKNIIG
8 L SYELTQPPSVSVSPGRIT H EVQLVQSGAASGYSFS P NNTRKAFYATGDIIG
9 A MPIMGSSVAVLAIL B DIVMTQSPTVTI C EVQLQQSGRGP
10 A MPIMGSSVVLAIL B DIVMTQSPTVTI C EVQLQQSGRGP
11 L DVVMTQTPLQ H EVKLDESVTVTSSTWPSQSITCNVAHPASSTKVDKKIE
12 A DIVMTQSPDAQYYSTPYSFGQGTKLEIKR
我想比较第三个元素和&&每行的第5个元素,如果它们具有相同的第3个&&第五要素。 例如,使用上面的数据,结果将是:
3: 3 A ALQLTQSPSSLSAS B RITLKESGPPLVKPTCS C ELDKWAN
4 A ALQLTQSPSSLSAS B RITLKESGPPLVKPTCS C ELDKWAG
5 A ALQLTQSPSSLSAS B RITLKESGPPLVKPTCS C LELDKWASL
9: 9 A MPIMGSSVAVLAIL B DIVMTQSPTVTI C EVQLQQSGRGP
10 A MPIMGSSVVLAIL B DIVMTQSPTVTI C EVQLQQSGRGP
Fyi,在实际数据中,第3,第5,第7个元素很长。我让他们切割看到整体。
这就是我所做的,我知道它非常笨拙,但作为初学者,我正在尽我所能。 问题是它只显示了第一组“相同”组。 你能告诉我它出错的地方和/或其他解决方法吗?
my $file = <>;
open(IN, $file)|| die "no $file: $!\n";
my @arr;
while (my $line=<IN>){
push @arr, [split (/\s+/, $line)] ;
}
close IN;
my (@temp1, @temp2,%hash1);
for (my $i=0;$i<=$#arr ;$i++) {
push @temp1, [$arr[$i][2], $arr[$i][4]];
for (my $j=$i+1;$j<=$#arr ;$j++) {
push @temp2, [$arr[$j][2], $arr[$j][4]];
if (($temp1[$i][0] eq $temp2[$j][0])&& ($temp1[$i][1] eq $temp2[$j][1])) {
push @{$hash1{$arr[$i][0]}}, $arr[$i], $arr[$j];
}
}
}
print Dumper \%hash1;
答案 0 :(得分:2)
你似乎比它需要的要多得多,但这对初学者来说很常见。想一想如何手动执行此操作:
循环,所有这一切都是完全没必要的:
#!/usr/bin/env perl
use strict;
use warnings;
my ($previous_row, $third, $fifth) = ('') x 3;
while (<DATA>) {
my @fields = split;
if ($fields[2] eq $third && $fields[4] eq $fifth) {
print $previous_row if $previous_row;
print "\t$_";
$previous_row = '';
} else {
$previous_row = $fields[0] . "\t" . $_;
$third = $fields[2];
$fifth = $fields[4];
}
}
__DATA__
1 L DIELTQSPE H EVQLQESDAELVKPGASVKISCKASGYTFTDHE
2 L DIVLTQSPRVT H EVQLQQSGAELVKPGASIKDTY
3 A ALQLTQSPSSLSAS B RITLKESGPPLVKPTCS C ELDKWAN
4 A ALQLTQSPSSLSAS B RITLKESGPPLVKPTCS C ELDKWAG
5 A ALQLTQSPSSLSAS B RITLKESGPPLVKPTCS C LELDKWASL
6 L DIQMTQIPSSLSASLSIC H EVQLQQSGVEVKMSCKASGYTFTS
7 L SYELTQPPSVSVSPGSIT H QVQLVQSAKGSGYSFS P YNKRKAFYTTKNIIG
8 L SYELTQPPSVSVSPGRIT H EVQLVQSGAASGYSFS P NNTRKAFYATGDIIG
9 A MPIMGSSVAVLAIL B DIVMTQSPTVTI C EVQLQQSGRGP
10 A MPIMGSSVAVLAIL B DIVMTQSPTVTI C EVQLQQSGRGP
11 L DVVMTQTPLQ H EVKLDESVTVTSSTWPSQSITCNVAHPASSTKVDKKIE
12 A DIVMTQSPDAQYYSTPYSFGQGTKLEIKR
(请注意,我稍微更改了第10行,以便第3个字段与第9行匹配,以便在输出中获得指定的相同组。)
编辑:一行代码被复制/粘贴错误复制。
编辑2:在回复评论时,这是第二个版本,它不假设应该分组的行是连续的:
#!/usr/bin/env perl
use strict;
use warnings;
my @lines;
while (<DATA>) {
push @lines, [ $_, split ];
}
# Sort @lines based on third and fifth fields (alphabetically), then on
# first field/line number (numerically) when third and fifth fields match
@lines = sort {
$a->[3] cmp $b->[3] || $a->[5] cmp $b->[5] || $a->[1] <=> $b->[1]
} @lines;
my ($previous_row, $third, $fifth) = ('') x 3;
for (@lines) {
if ($_->[3] eq $third && $_->[5] eq $fifth) {
print $previous_row if $previous_row;
print "\t$_->[0]";
$previous_row = '';
} else {
$previous_row = $_->[1] . "\t" . $_->[0];
$third = $_->[3];
$fifth = $_->[5];
}
}
__DATA__
1 L DIELTQSPE H EVQLQESDAELVKPGASVKISCKASGYTFTDHE
3 A ALQLTQSPSSLSAS B RITLKESGPPLVKPTCS C ELDKWAN
2 L DIVLTQSPRVT H EVQLQQSGAELVKPGASIKDTY
5 A ALQLTQSPSSLSAS B RITLKESGPPLVKPTCS C LELDKWASL
7 L SYELTQPPSVSVSPGSIT H QVQLVQSAKGSGYSFS P YNKRKAFYTTKNIIG
6 L DIQMTQIPSSLSASLSIC H EVQLQQSGVEVKMSCKASGYTFTS
9 A MPIMGSSVAVLAIL B DIVMTQSPTVTI C EVQLQQSGRGP
8 L SYELTQPPSVSVSPGRIT H EVQLVQSGAASGYSFS P NNTRKAFYATGDIIG
11 L DVVMTQTPLQ H EVKLDESVTVTSSTWPSQSITCNVAHPASSTKVDKKIE
10 A MPIMGSSVAVLAIL B DIVMTQSPTVTI C EVQLQQSGRGP
12 A DIVMTQSPDAQYYSTPYSFGQGTKLEIKR
4 A ALQLTQSPSSLSAS B RITLKESGPPLVKPTCS C ELDKWAG
答案 1 :(得分:1)
示例:
use strict;
use warnings;
{ ... }
open my $fh, '<', $file or die "can't open $file: $!";
my %hash;
# read and save it
while(my $line = <$fh>){
my @line = split /\s+/, $line;
my $key = $line[2] . ' ' . $line[4];
$hash{$key} ||= [];
push @{$hash{$key}}, $line;
}
# remove single elements
for my $key (keys %hash){
delete $hash{$key} if @{$hash{$key}} < 2;
}
print Dumper \%hash;
答案 2 :(得分:1)
略有不同的方法:
#!/usr/bin/perl
use strict;
use warnings;
my %lines; # hash with 3rd and 5th elements as key
my %first_line_per_group; # stores in which line a group appeared first
while(my $line = <>) {
# remove line break
chomp $line;
# retrieve elements form line
my @elements = split /\s+/, $line;
# ignore invalid lines
next if @elements < 5;
# build key from elements 3 and 5 (array 0-based!)
my $key = $elements[2] . " " . $elements[4];
if(! $lines{key}) {
$first_line_per_group{$key} = $elements[0];
}
push @{ $lines{$key} }, $line;
}
# output
for my $key (keys %lines) {
print $first_line_per_group{$key} . ":\n";
print " $_\n" for @{ $lines{$key} };
}
答案 3 :(得分:0)
你的方法显示了对Perl习语的非常可靠的把握,并且有其优点,但仍然不是我会怎么做。
如果您对数据的结构略有不同,我认为您可以更轻松地使用此功能:让%hash1
类似
(
'ALQLTQSPSSLSAS' => {
'RITLKESGPPLVKPTCS' => [3, 4, 5],
'ABCXYZ' => [93, 95, 96],
},
'MPIMGSSVAVLAIL' => {
'DIVMTQSPTVTI' => [9, 10],
},
)
我在其中添加了一个数据ABCXYZ
,该数据不在您的示例中,以显示数据结构的完整性。
答案 4 :(得分:0)
您应该使用open()的3参数形式,并且可以简化数据读取:
open my $fh, '<', $file
or die "Cannot open '$file': $!\n";
chomp(my @rows = <$fh>);
@rows = map {[split]} @rows;
close $fh;
要对行进行分组,可以使用连接为第3和第5个字段作为键的哈希。编辑:您必须添加分隔字符以消除无效结果“如果不同的行产生相同的连接”(Qtax)。附加数据(例如,各个数据行的数量)可以存储为散列值。这里存储行的字段:
my %groups;
for (@rows) {
push @{ $groups{$_->[2] . ' ' . $_->[4]} }, $_
if @$_ >= 4;
}
整理单个元素:
@{ $groups{$_} } < 2 && delete $groups{$_}
for keys %groups;
映入眼帘, 的Matthias