如何从php中的多个文本文件计算tf-idf?

时间:2015-01-30 18:57:34

标签: php tf-idf

我成功地从数组计算tf-idf。现在我希望tf-idf应该从多个文本文件计算,因为我的目录中有多个文本文件。任何人都可以为多个文本文件修改此代码,以便首先读取目录中的所有文件,然后根据这些文件内容tf-idf计算..下面是我的代码谢谢......

$collection = array(
    1 => 'this string is a short string but a good string',
    2 => 'this one isn\'t quite like the rest but is here',
    3 => 'this is a different short string that\' not as short'
);

$dictionary = array();
$docCount = array();

foreach($collection as $docID => $doc) {
    $terms = explode(' ', $doc);
    $docCount[$docID] = count($terms);

    foreach($terms as $term) {
        if(!isset($dictionary[$term])) {
            $dictionary[$term] = array('df' => 0, 'postings' => array());
        }
        if(!isset($dictionary[$term]['postings'][$docID])) {
            $dictionary[$term]['df']++;
            $dictionary[$term]['postings'][$docID] = array('tf' => 0);
        }

        $dictionary[$term]['postings'][$docID]['tf']++;
    }
}

$temp = ('docCount' => $docCount, 'dictionary' => $dictionary);

计算tf-idf

$index = $temp;
$docCount = count($index['docCount']);
$entry = $index['dictionary'][$term];
foreach($entry['postings'] as  $docID => $postings) {
    echo "Document $docID and term $term give TFIDF: " .
        ($postings['tf'] * log($docCount / $entry['df'], 2));
    echo "\n";
}

1 个答案:

答案 0 :(得分:2)

看一下这个答案:Reading all file contents from a directory - php

在那里,您可以找到有关如何从目录中读取所有文件内容的信息 有了这些信息,您应该可以修改您的代码,以使其按预期工作。