我成功地从数组计算tf-idf。现在我希望tf-idf应该从多个文本文件计算,因为我的目录中有多个文本文件。任何人都可以为多个文本文件修改此代码,以便首先读取目录中的所有文件,然后根据这些文件内容tf-idf计算..下面是我的代码谢谢......
$collection = array(
1 => 'this string is a short string but a good string',
2 => 'this one isn\'t quite like the rest but is here',
3 => 'this is a different short string that\' not as short'
);
$dictionary = array();
$docCount = array();
foreach($collection as $docID => $doc) {
$terms = explode(' ', $doc);
$docCount[$docID] = count($terms);
foreach($terms as $term) {
if(!isset($dictionary[$term])) {
$dictionary[$term] = array('df' => 0, 'postings' => array());
}
if(!isset($dictionary[$term]['postings'][$docID])) {
$dictionary[$term]['df']++;
$dictionary[$term]['postings'][$docID] = array('tf' => 0);
}
$dictionary[$term]['postings'][$docID]['tf']++;
}
}
$temp = ('docCount' => $docCount, 'dictionary' => $dictionary);
计算tf-idf
$index = $temp;
$docCount = count($index['docCount']);
$entry = $index['dictionary'][$term];
foreach($entry['postings'] as $docID => $postings) {
echo "Document $docID and term $term give TFIDF: " .
($postings['tf'] * log($docCount / $entry['df'], 2));
echo "\n";
}
答案 0 :(得分:2)
看一下这个答案:Reading all file contents from a directory - php
在那里,您可以找到有关如何从目录中读取所有文件内容的信息
有了这些信息,您应该可以修改您的代码,以使其按预期工作。