我希望Perl(5.8.8)找出哪个单词与数组中的其他单词具有最多共同字母 - 但只能找到位于同一位置的字母。 (最好不使用libs。)
以这个单词列表为例:
她的BALER是与其他人共有最多字母的词。它匹配BAKER中的BAxER,SALER中的xALER,CARER中的xAxER和RUFFR中的xxxxR。
我希望Perl能够在具有相同长度和大小写的任意单词列表中为我找到这个单词。似乎我在这里碰壁了,所以非常感谢帮助!
目前没有太多的脚本:
use strict;
use warnings;
my @wordlist = qw(BAKER SALER MALER BARER RUFFR);
foreach my $word (@wordlist) {
my @letters = split(//, $word);
# now trip trough each iteration and work magic...
}
评论在哪里,我尝试了几种代码,包括for-loops和++ varables。到目前为止,我的尝试都没有完成我需要做的事情。
所以,为了更好地解释:我需要的是逐字逐句地测试,对于每个字母位置,找到与列表中其他字母具有最多共同字母的单词,在该字母的位置位置。
一种可能的方法是首先检查哪个单词在字母位置0处最常见,然后测试字母位置1,依此类推,直到找到总和中字数最多的单词与列表中的其他单词一样。然后我想打印列表,就像一个矩阵,每个字母位置的分数加上每个单词的总分,与DavidO建议的不同。
您实际上最终得到的是每个单词的矩阵,每个字母位置的分数,以及矩阵中每个单词前后的总分。
这是我用于研究的Fallout 3终端黑客教程之一:FALLOUT 3: Hacking FAQ v1.2,我已经制定了一个缩短单词列表的程序,如下所示:
#!/usr/bin/perl
# See if one word has equal letters as the other, and how many of them are equal
use strict;
use warnings;
my $checkword = "APPRECIATION"; # the word to be checked
my $match = 4; # equal to the match you got from testing your checkword
my @checkletters = split(//, $checkword); #/
my @wordlist = qw(
PARTNERSHIPS
REPRIMANDING
CIVILIZATION
APPRECIATION
CONVERSATION
CIRCUMSTANCE
PURIFICATION
SECLUSIONIST
CONSTRUCTION
DISAPPEARING
TRANSMISSION
APPREHENSIVE
ENCOUNTERING
);
print "$checkword has $match letters in common with:\n";
foreach my $word (@wordlist) {
next if $word eq $checkword;
my @letters = split(//, $word);
my $length = @letters; # determine length of array (how many letters to check)
my $eq_letters = 0; # reset to 0 for every new word to be tested
for (my $i = 0; $i < $length; $i++) {
if ($letters[$i] eq $checkletters[$i]) {
$eq_letters++;
}
}
if ($eq_letters == $match) {
print "$word\n";
}
}
# Now to make a script on to find the best word to check in the first place...
此脚本将生成CONSTRUCTION
和TRANSMISSION
作为结果,就像游戏常见问题解答一样。原始问题的诀窍(以及我自己无法找到的东西)是如何找到最好的单词来尝试,即APPRECIATION
。
好的,我现在已根据您的帮助提供了我自己的解决方案,并认为此线程已关闭。很多,非常感谢所有的贡献者。你帮助很大,而且我也学到了很多东西。:D
答案 0 :(得分:7)
这是一种方式。重新阅读你的规范几次,我认为这就是你要找的东西。
值得一提的是,有可能会有多个单词得分相同的单词。从你的列表中只有一个赢家,但有可能在更长的列表中,会有几个同样获胜的单词。该解决方案涉及到这一点。此外,据我所知,只有当字母匹配出现在每个单词的同一列时,才会计算字母匹配。如果是这种情况,这是一个有效的解决方案:
use 5.012;
use strict;
use warnings;
use List::Util 'max';
my @words = qw/
BAKER
SALER
BALER
CARER
RUFFR
/;
my @scores;
foreach my $word ( @words ) {
my $score;
foreach my $comp_word ( @words ) {
next if $comp_word eq $word;
foreach my $pos ( 0 .. ( length $word ) - 1 ) {
$score++ if substr( $word, $pos, 1 ) eq substr( $comp_word, $pos, 1);
}
}
push @scores, $score;
}
my $max = max( @scores );
my ( @max_ixs ) = grep { $scores[$_] == $max } 0 .. $#scores;
say "Words with most matches:";
say for @words[@max_ixs];
此解决方案计算每个字母列的每个字母与其他字词匹配的次数。例如:
Words: Scores: Because:
ABC 1, 2, 1 = 4 A matched once, B matched twice, C matched once.
ABD 1, 2, 1 = 4 A matched once, B matched twice, D matched once.
CBD 0, 2, 1 = 3 C never matched, B matched twice, D matched once.
BAC 0, 0, 1 = 1 B never matched, A never matched, C matched once.
这将为您提供ABC和ABD的获胜者,每个人都有四个位置匹配的分数。即,第一列,第一行与第一行第二行,第三行和第四行匹配的累积时间,以及后续列的依此类推。 它可能能够进一步优化,并且重新措辞更短,但我试图保持逻辑相当容易阅读。享受!
更新/编辑 我想到了它,并意识到虽然我现有的方法完全按照原始问题的要求进行,但它在O(n ^ 2)时间内完成,这相对较慢。但是如果我们对每个列的字母使用哈希键(每个键一个字母),并计算每个字母在列中出现的次数(作为哈希元素的值),我们可以在O(1)中进行求和。 )时间,以及我们在O(n * c)时间内遍历列表(其中c是列数,n是单词数)。还有一些设置时间(创建哈希)。但我们仍然有很大的进步。这是每种技术的新版本,以及每种技术的基准比较。
use strict;
use warnings;
use List::Util qw/ max sum /;
use Benchmark qw/ cmpthese /;
my @words = qw/
PARTNERSHIPS
REPRIMANDING
CIVILIZATION
APPRECIATION
CONVERSATION
CIRCUMSTANCE
PURIFICATION
SECLUSIONIST
CONSTRUCTION
DISAPPEARING
TRANSMISSION
APPREHENSIVE
ENCOUNTERING
/;
# Just a test run for each solution.
my( $top, $indexes_ref );
($top, $indexes_ref ) = find_top_matches_force( \@words );
print "Testing force method: $top matches.\n";
print "@words[@$indexes_ref]\n";
( $top, $indexes_ref ) = find_top_matches_hash( \@words );
print "Testing hash method: $top matches.\n";
print "@words[@$indexes_ref]\n";
my $count = 20000;
cmpthese( $count, {
'Hash' => sub{ find_top_matches_hash( \@words ); },
'Force' => sub{ find_top_matches_force( \@words ); },
} );
sub find_top_matches_hash {
my $words = shift;
my @scores;
my $columns;
my $max_col = max( map { length $_ } @{$words} ) - 1;
foreach my $col_idx ( 0 .. $max_col ) {
$columns->[$col_idx]{ substr $_, $col_idx, 1 }++
for @{$words};
}
foreach my $word ( @{$words} ) {
my $score = sum(
map{
$columns->[$_]{ substr $word, $_, 1 } - 1
} 0 .. $max_col
);
push @scores, $score;
}
my $max = max( @scores );
my ( @max_ixs ) = grep { $scores[$_] == $max } 0 .. $#scores;
return( $max, \@max_ixs );
}
sub find_top_matches_force {
my $words = shift;
my @scores;
foreach my $word ( @{$words} ) {
my $score;
foreach my $comp_word ( @{$words} ) {
next if $comp_word eq $word;
foreach my $pos ( 0 .. ( length $word ) - 1 ) {
$score++ if
substr( $word, $pos, 1 ) eq substr( $comp_word, $pos, 1);
}
}
push @scores, $score;
}
my $max = max( @scores );
my ( @max_ixs ) = grep { $scores[$_] == $max } 0 .. $#scores;
return( $max, \@max_ixs );
}
输出结果为:
Testing force method: 39 matches.
APPRECIATION
Testing hash method: 39 matches.
APPRECIATION
Rate Force Hash
Force 2358/s -- -74%
Hash 9132/s 287% --
我看到你看到其他一些选项之后你的原始规格发生了变化,这在某种程度上是创新的本质,但这个难题仍然存在于我的脑海中。如您所见,我的哈希方法比原始方法快287%。在更短的时间内获得更多乐趣!
答案 1 :(得分:5)
作为一个起点,您可以有效地检查他们共有多少个字母:
$count = ($word1 ^ $word2) =~ y/\0//;
但这只有在你遍历所有可能的单词对时才有用,在这种情况下这是不必要的:
use strict;
use warnings;
my @words = qw/
BAKER
SALER
BALER
CARER
RUFFR
/;
# you want a hash to indicate which letters are present how many times in each position:
my %count;
for my $word (@words) {
my @letters = split //, $word;
$count{$_}{ $letters[$_] }++ for 0..$#letters;
}
# then for any given word, you get the count for each of its letters minus one (because the word itself is included in the count), and see if it is a maximum (so far) for any position or for the total:
my %max_common_letters_count;
my %max_common_letters_words;
for my $word (@words) {
my @letters = split //, $word;
my $total;
for my $position (0..$#letters, 'total') {
my $count;
if ( $position eq 'total' ) {
$count = $total;
}
else {
$count = $count{$position}{ $letters[$position] } - 1;
$total += $count;
}
if ( ! $max_common_letters_count{$position} || $count >= $max_common_letters_count{$position} ) {
if ( $max_common_letters_count{$position} && $count == $max_common_letters_count{$position} ) {
push @{ $max_common_letters_words{$position} }, $word;
}
else {
$max_common_letters_count{$position} = $count;
$max_common_letters_words{$position} = [ $word ];
}
}
}
}
# then show the maximum words for each position and in total:
for my $position ( sort { $a <=> $b } grep $_ ne 'total', keys %max_common_letters_count ) {
printf( "Position %s had a maximum of common letters of %s in words: %s\n",
$position,
$max_common_letters_count{$position},
join(', ', @{ $max_common_letters_words{$position} })
);
}
printf( "The maximum total common letters was %s in words(s): %s\n",
$max_common_letters_count{'total'},
join(', ', @{ $max_common_letters_words{'total'} })
);
答案 2 :(得分:4)
这是一个完整的脚本。它使用与ysth提到的相同的想法(虽然我独立地拥有它)。使用按位xor组合字符串,然后计算结果中的NUL数。只要您的字符串是ASCII,就会告诉您有多少匹配的字母。 (这种比较区分大小写,我不确定如果字符串是UTF-8会发生什么。可能没什么好处。)
use strict;
use warnings;
use 5.010;
use List::Util qw(max);
sub findMatches
{
my ($words) = @_;
# Compare each word to every other word:
my @matches = (0) x @$words;
for my $i (0 .. $#$words-1) {
for my $j ($i+1 .. $#$words) {
my $m = ($words->[$i] ^ $words->[$j]) =~ tr/\0//;
$matches[$i] += $m;
$matches[$j] += $m;
}
}
# Find how many matches in the best word:
my $max = max(@matches);
# Find the words with that many matches:
my @wanted = grep { $matches[$_] == $max } 0 .. $#matches;
wantarray ? @$words[@wanted] : $words->[$wanted[0]];
} # end findMatches
my @words = qw(
BAKER
SALER
BALER
CARER
RUFFR
);
say for findMatches(\@words);
答案 3 :(得分:2)
有一段时间没有触及perl,所以伪代码就是这样。这不是最快的算法,但它适用于少量单词。
totals = new map #e.g. an object to map :key => :value
for each word a
for each word b
next if a equals b
totals[a] = 0
for i from 1 to a.length
if a[i] == b[i]
totals[a] += 1
end
end
end
end
return totals.sort_by_key.last
很抱歉缺少perl,但是如果你将它编码为perl,它应该像魅力一样。
关于运行时的快速说明:这将在时间 number_of_words ^ 2 * length_of_words 中运行,因此在包含100个单词的列表中,每个单词长度为10个字符,这将在100,000个周期内运行,适合大多数应用。
答案 4 :(得分:1)
这是我尝试回答的问题。如果需要,您还可以查看每个匹配项。 (即BALER在BAKER中匹配4个字符)。 编辑:如果单词之间存在联系,它现在会捕获所有匹配项(我将“CAKER”添加到要测试的列表中)。
#! usr/bin/perl
use strict;
use warnings;
my @wordlist = qw( BAKER SALER BALER CARER RUFFR CAKER);
my %wordcomparison;
#foreach word, break it into letters, then compare it against all other words
#break all other words into letters and loop through the letters (both words have same amount), adding to the count of matched characters each time there's a match
foreach my $word (@wordlist) {
my @letters = split(//, $word);
foreach my $otherword (@wordlist) {
my $count;
next if $otherword eq $word;
my @otherwordletters = split (//, $otherword);
foreach my $i (0..$#letters) {
$count++ if ( $letters[$i] eq $otherwordletters[$i] );
}
$wordcomparison{"$word"}{"$otherword"} = $count;
}
}
# sort (unnecessary) and loop through the keys of the hash (words in your list)
# foreach key, loop through the other words it compares with
#Add a new key: total, and sum up all the matched characters.
foreach my $word (sort keys %wordcomparison) {
foreach ( sort keys %{ $wordcomparison{$word} }) {
$wordcomparison{$word}{total} += $wordcomparison{$word}{$_};
}
}
#Want $word with highest total
my @max_match = (sort { $wordcomparison{$b}{total} <=> $wordcomparison{$a}{total} } keys %wordcomparison );
#This is to get all if there is a tie:
my $maximum = $max_match[0];
foreach (@max_match) {
print "$_\n" if ($wordcomparison{$_}{total} >= $wordcomparison{$maximum}{total} )
}
输出简单:CAKER BALER和BAKER。
哈希%wordcomparison
看起来像:
'SALER'
{
'RUFFR' => 1,
'BALER' => 4,
'BAKER' => 3,
'total' => 11,
'CARER' => 3
};
答案 5 :(得分:1)
这是一个依赖于转置单词以便计算相同字符的版本。我使用了原始比较中的单词,而不是代码。
这适用于任何长度的单词和任何长度列表。输出是:
Word score
---- -----
BALER 12
SALER 11
BAKER 11
CARER 10
RUFFR 4
代码:
use warnings;
use strict;
my @w = qw(BAKER SALER BALER CARER RUFFR);
my @tword = t_word(@w);
my @score;
push @score, str_count($_) for @tword;
@score = t_score(@score);
my %total;
for (0 .. $#w) {
$total{$w[$_]} = $score[$_];
}
print "Word\tscore\n";
print "----\t-----\n";
print "$_\t$total{$_}\n" for (sort { $total{$b} <=> $total{$a} } keys %total);
# transpose the words
sub t_word {
my @w = @_;
my @tword;
for my $word (@w) {
my $i = 0;
while ($word =~ s/(.)//) {
$tword[$i++] .= $1;
}
}
return @tword;
}
# turn each character into a count
sub str_count {
my $str = uc(shift);
while ( $str =~ /([A-Z])/ ) {
my $chr = $1;
my $num = () = $str =~ /$chr/g;
$num--;
$str =~ s/$chr/$num /g;
}
return $str;
}
# sum up the character counts
# while reversing the transpose
sub t_score {
my @count = @_;
my @score;
for my $num (@count) {
my $i = 0;
while( $num =~ s/(\d+) //) {
$score[$i++] += $1;
}
}
return @score;
}
答案 6 :(得分:0)
你可以这样做,使用一个脏的正则表达技巧来执行代码,如果一个字母在其位置匹配,但不是,否则,谢天谢地,你可以很容易地构建正则表达式:
正则表达式的示例是:
(?:(C(?{ $c++ }))|.)(?:(A(?{ $c++ }))|.)(?:(R(?{ $c++ }))|.)(?:(E(?{ $c++ }))|.)(?:(R(?{ $c++ }))|.)
这可能会也可能不会很快。
use 5.12.0;
use warnings;
use re 'eval';
my @words = qw(BAKER SALER BALER CARER RUFFR);
my ($best, $count) = ('', 0);
foreach my $word (@words) {
our $c = 0;
foreach my $candidate (@words) {
next if $word eq $candidate;
my $regex_str = join('', map {"(?:($_(?{ \$c++ }))|.)"} split '', $word);
my $regex = qr/^$regex_str$/;
$candidate =~ $regex or die "did not match!";
}
say "$word $c";
if ($c > $count) {
$best = $word;
$count = $c;
}
}
say "Matching: first best: $best";
使用xor技巧会很快,但会假设您可能遇到的字符范围很多。有很多方法可以让utf-8打破这种情况。
答案 7 :(得分:0)
非常感谢所有捐助者!你当然告诉我,我还有很多需要学习的东西,但你也帮助我制定了自己的答案。我只是把它放在这里作为参考和可能的反馈,因为可能有更好的方法。对我而言,这是我自己能找到的最简单,最直接的方法。请享用! :)
#!/usr/bin/perl
use strict;
use warnings;
# a list of words for testing
my @list = qw(
BAKER
SALER
BALER
CARER
RUFFR
);
# populate two dimensional array with the list,
# so we can compare each letter with the other letters on the same row more easily
my $list_length = @list;
my @words;
for (my $i = 0; $i < $list_length; $i++) {
my @letters = split(//, $list[$i]);
my $letters_length = @letters;
for (my $j = 0; $j < $letters_length; $j++) {
$words[$i][$j] = $letters[$j];
}
}
# this gives a two-dimensionla array:
#
# @words = ( ["B", "A", "K", "E", "R"],
# ["S", "A", "L", "E", "R"],
# ["B", "A", "L", "E", "R"],
# ["C", "A", "R", "E", "R"],
# ["R", "U", "F", "F", "R"],
# );
# now, on to find the word with most letters in common with the other on the same row
# add up the score for each letter in each word
my $word_length = @words;
my @letter_score;
for my $i (0 .. $#words) {
for my $j (0 .. $#{$words[$i]}) {
for (my $k = 0; $k < $word_length; $k++) {
if ($words[$i][$j] eq $words[$k][$j]) {
$letter_score[$i][$j] += 1;
}
}
# we only want to add in matches outside the one we're testing, therefore
$letter_score[$i][$j] -= 1;
}
}
# sum each score up
my @scores;
for my $i (0 .. $#letter_score ) {
for my $j (0 .. $#{$letter_score[$i]}) {
$scores[$i] += $letter_score[$i][$j];
}
}
# find the highest score
my $max = $scores[0];
foreach my $i (@scores[1 .. $#scores]) {
if ($i > $max) {
$max = $i;
}
}
# and print it all out :D
for my $i (0 .. $#letter_score ) {
print "$list[$i]: $scores[$i]";
if ($scores[$i] == $max) {
print " <- best";
}
print "\n";
}
运行时,脚本会生成以下内容:
BAKER: 11
SALER: 11
BALER: 12 <- best
CARER: 10
RUFFR: 4