我在数组中有很多字符串 - thousends。我需要将该数组中的所有字符串相互比较,并从中找到最独特的字符串。
你可以看到并测试我的代码,但正如你所看到的 - 它需要花费很多时间(在localhost = Intel Core i7上aprox。160s)来比较100个项目,我需要比较thousends ...任何想法如何优化这段代码?
我不需要优化代码的第一部分(生成数据),因为我从其他地方提取数据。我只需要优化代码的第二部分(比较)。正如有人注意到的那样,脚本可以通过不进行重复比较来优化(a - > b,b - > a) - 我知道这一点,但我仍然试图节省更多的时间而不是一半。也许有更好的功能来比较字符串而不是类似的文字,但我没有其他东西的经验,这就是我在这里问的原因......
代码:
<?php
//set how many strings generate for test
$number_of_test_strings = 100;
$strings = array();
$chars = "abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789";
$size_chars_array = strlen( $chars );
/*
* Creating some random strings - data for test
*/
//just for testing performance
$creating_test_data_time_start = microtime();
//create some random strings in to array
for ( $i = 1; $i < $number_of_test_strings; $i++ ) {
//set random string to empty string
$random_string = '';
//choose by random from characters array - also the length of random string will be random - between 1800 and 2500chars
for( $j = 0; $j < rand ( 1800, 2500); $j++ ) {
$random_string .= $chars[ rand( 0, $size_chars_array - 1 ) ];
}
//insert random string in to strings array
$strings[] = $random_string;
}
//just for testing performance
$creating_test_data_time_end = microtime();
/*
* Comparison itself
*/
//just for testing performance
$uniqueness_time_start = microtime();
//foreach for all strings in array
foreach ($strings as $key_first_element => $first_element) {
//reset of matched value
$matched = 0;
//foreach with each first element
foreach ($strings as $key_second_element => $second_element) {
// dont compare the same string
if ($key_first_element != $key_second_element) {
//compare those two strings
similar_text($first_element, $second_element, $match);
//add match value to matched
$matched = ($matched + $match);
}
}
// create average uniqueness for that string
$uniqueness = ($matched / (count($strings) - 1));
//store it in array
$uniqueness_array[$key_first_element] = $uniqueness;
}
//sort the array by uniqueness (less match the better)- the best on the beginning
asort($uniqueness_array);
//just for testing performance
$uniqueness_time_end = microtime();
//just output performance info
echo 'Creating of test data: '. (array_sum( explode( ' ' , $creating_test_data_time_end ) ) - array_sum( explode( ' ' , $creating_test_data_time_start ) )) .' s, comparing strings: '. (array_sum( explode( ' ' , $uniqueness_time_end ) ) - array_sum( explode( ' ' , $uniqueness_time_start ) )) .' s<br />';
$i = 0;
foreach ($uniqueness_array as $key_string => $uniquness_of_string)
{
// output just 10 best results
if ($i < 10) {
echo 'Uniqueness of a string with key '.$key_string.' is '.$uniquness_of_string.'<br />';
$i++;
}
else break;
}
?>
预期的输入和输出:
//Expected input array
$input = array(
'Today is a great day for skiing and I dont have enough time',
'Wednesday is a very good day for skiing and snowboarding and I dont have enough time',
'Today is a superior day for skiing and I dont have enough time',
'Completly different string about nothing'
);
//Expected output array - the order is important - the most different strings at the beginning of the array
$output = array(
'Completly different string about nothing',
'Wednesday is a very good day for skiing and snowboarding and I dont have enough time',
'Today is a superior day for skiing and I dont have enough time',
'Today is a great day for skiing and I dont have enough time'
);
答案 0 :(得分:1)
我真的不认为similar_text
已经足够了......你可以将它与levenshtein
结合起来以获得你想要的结果。
$words = array(
'Today is a great day for skiing and I dont have enough time',
'Wednesday is a very good day for skiing and snowboarding and I dont have enough time',
'Today is a superior day for skiing and I dont have enough time',
'Completly different string about nothing'
);
$unique = array_map(function ($v) use($words) {
return new Word($words, $v);
}, $words);
使用类似文字
echo "Uniqness By similar_text\n\n";
usort($unique, function ($a, $b) {
$a = $a->getSimilar();
$b = $b->getSimilar();
return ($a == $b) ? 0 : (($a < $b) ? - 1 : 1);
});
foreach ( $unique as $var ) {
printf("%s (%s) \n",$var->getWord(),$var->getSimilar());
}
类似文字输出
Uniqness By similar_text
Completly different string about nothing (36.363636363636)
Wednesday is a very good day for skiing and snowboarding and I dont have enough time (75.342465753425)
Today is a great day for skiing and I dont have enough time (90.909090909091)
Today is a superior day for skiing and I dont have enough time (90.909090909091)
您可以看到Today is a great
和Today is a superior
未处于正确位置
使用levenshtein
echo "\n\nUniqness By levenshtein\n\n";
usort($unique, function ($a, $b) {
$a = $a->getLev();
$b = $b->getLev();
return ($a == $b) ? 0 : (($a < $b) ? 1 : - 1);
});
foreach ( $unique as $var ) {
printf("%s (%s) \n", $var->getWord(), $var->getLev());
}
levenshtein输出
Uniqness By levenshtein
Completly different string about nothing (63)
Wednesday is a very good day for skiing and snowboarding and I dont have enough time (63)
Today is a superior day for skiing and I dont have enough time (45)
Today is a great day for skiing and I dont have enough time (43)
正如您所看到的,Today is a superior
和Today is a great
距离都非常接近levenshtein
..如果它们最终相同,则结果可能不是最新的
将两者结合起来获得简单索引
echo "\n\nUniqness By Simple Index \n\n";
usort($unique, function ($a, $b) {
$a = $a->getIndex();
$b = $b->getIndex();
return ($a == $b) ? 0 : (($a < $b) ? - 1 : 1);
});
foreach ( $unique as $var ) {
printf("%s (%s) \n", $var->getWord(), $var->getIndex());
}
简单索引输出
Uniqness By Simple Index
Completly different string about nothing (0.57720057720058)
Wednesday is a very good day for skiing and snowboarding and I dont have enough time (1.1959121548163)
Today is a superior day for skiing and I dont have enough time (2.020202020202)
Today is a great day for skiing and I dont have enough time (2.1141649048626)
结合两者可以更好地解决可能的冲突
使用的课程
class Word {
private $lev = 0;
private $similar = 0;
private $index = 0;
private $word;
function __construct($words, $word) {
$this->word = $word;
foreach ( $words as $selected ) {
if ($selected == $word)
continue;
$lev = levenshtein($word, $selected);
if ($lev > $this->lev)
$this->lev = $lev;
similar_text($word, $selected, $match);
if ($match > $this->similar)
$this->similar = $match;
}
$this->index = $this->similar / $this->lev;
}
function getLev() {
return $this->lev;
}
function getSimilar() {
return $this->similar;
}
function getIndex() {
return $this->index;
}
function getWord() {
return $this->word;
}
}