我有两个PHP函数来计算两个文本之间的关系。他们都使用单词模型包,但check2()更快。无论如何,这两个函数都给出了相同的结果。为什么? check1()使用一个包含所有单词的大字典数组 - 如单词包模型中所述。 check2()不使用一个大数组,而是一个只包含一个文本的单词的数组。所以check2()不应该工作,但它没有。为什么两个函数都给出相同的结果?
function check1($terms_in_article1, $terms_in_article2) {
global $zeit_check1;
$zeit_s = microtime(TRUE);
$length1 = count($terms_in_article1); // number of words
$length2 = count($terms_in_article2); // number of words
$all_terms = array_merge($terms_in_article1, $terms_in_article2);
$all_terms = array_unique($all_terms);
foreach ($all_terms as $all_termsa) {
$term_vector1[$all_termsa] = 0;
$term_vector2[$all_termsa] = 0;
}
foreach ($terms_in_article1 as $terms_in_article1a) {
$term_vector1[$terms_in_article1a]++;
}
foreach ($terms_in_article2 as $terms_in_article2a) {
$term_vector2[$terms_in_article2a]++;
}
$score = 0;
foreach ($all_terms as $all_termsa) {
$score += $term_vector1[$all_termsa]*$term_vector2[$all_termsa];
}
$score = $score/($length1*$length2);
$score *= 500; // for better readability
$zeit_e = microtime(TRUE);
$zeit_check1 += ($zeit_e-$zeit_s);
return $score;
}
function check2($terms_in_article1, $terms_in_article2) {
global $zeit_check2;
$zeit_s = microtime(TRUE);
$length1 = count($terms_in_article1); // number of words
$length2 = count($terms_in_article2); // number of words
$score_table = array();
foreach($terms_in_article1 as $term){
if(!isset($score_table[$term])) $score_table[$term] = 0;
$score_table[$term] += 1;
}
$score_table2 = array();
foreach($terms_in_article2 as $term){
if(isset($score_table[$term])){
if(!isset($score_table2[$term])) $score_table2[$term] = 0;
$score_table2[$term] += 1;
}
}
$score = 0;
foreach($score_table2 as $key => $entry){
$score += $score_table[$key] * $entry;
}
$score = $score/($length1*$length2);
$score *= 500;
$zeit_e = microtime(TRUE);
$zeit_check2 += ($zeit_e-$zeit_s);
return $score;
}
我希望你能帮助我。提前谢谢!
答案 0 :(得分:6)
因为您似乎关注性能,所以这里是check2函数中的算法的优化版本,它使用一些更多的内置函数来提高速度。
function check ($terms1, $terms2)
{
$counts1 = array_count_values($terms1);
$totalScore = 0;
foreach ($terms2 as $term) {
if (isset($counts1[$term])) $totalScore += $counts1[$term];
}
return $totalScore * 500 / (count($terms1) * count($terms2));
}
答案 1 :(得分:3)
这两个函数实现了几乎相同的算法,但是第一个函数以直接的方式执行,第二个函数更聪明一点,并且跳过了一部分不必要的工作。
check1是这样的:
// loop length(words1) times
for each word in words1:
freq1[word]++
// loop length(words2) times
for each word in words2:
freq2[word]++
// loop length(union(words1, words2)) times
for each word in union(words1, words2):
score += freq1[word] * freq2[word]
但请记住:当你将某些东西乘以零时,你就会得到零。
这意味着,计算不在两组中的单词的频率是浪费时间 - 我们将频率乘以零,这将不会给分数增加任何内容。
check2将此考虑在内:
// loop length(words1) times
for each word in words1:
freq1[word]++
// loop length(words2) times
for each word in words2:
if freq1[word] > 0:
freq2[word]++
// loop length(intersection(words1, words2)) times
for each word in freq2:
score += freq1[word] * freq2[word]