我该如何实现这个
tf-idf(WORD)=出现次数(WORD,DOCUMENT)/字数(DOCUMENT)* log10(文件(ALL)/(1 +文件(WORD,ALL)))
进入我的PHP编码排序搜索结果?
可以参考目前的编码:
答案 0 :(得分:1)
我只了解您要求的部分内容,但我认为我可以帮助您完成occurrences(WORD,DOCUMENT) / number-of-words(DOCUMENT)
部分:
<?php
function rank($word, $document)
{
// Swap newlines for spaces, you'll see why
$document = str_replace("\n",' ',$document);
// Remove special characters except '-' from the string
for($i = 0; $i <= 127; $i++)
{
// Space is allowed, Hyphen is a legitimate part of some words. Also allow range for 0-9, A-Z, and a-z
// Extended ASCII (128 - 255) is purposfully excluded from this since it isn't often used
if($i != 32 && $i != 45 && !($i >= 48 && $i <=57) && !($i >= 65 && $i <= 90) && !($i >= 97 && $i <= 122))
$document = str_replace(chr($i),'',$document);
}
// Split the document on spaces. This gives us individual words
$tmpDoc = explode(' ',trim($document));
// Get the number of elements with $word in them
$occur = count(array_keys($tmpDoc,$word));
// Get the total number of elements
$numWords = count($tmpDoc);
return $occur / $numWords;
}
?>
我确信有更有效的方法可以做到这一点,但肯定还有更糟糕的方法。
注意:我没有测试PHP代码