Question

我成功地从数组计算tf-idf。现在我希望tf-idf应该从多个文本文件计算，因为我的目录中有多个文本文件。任何人都可以为多个文本文件修改此代码，以便首先读取目录中的所有文件，然后根据这些文件内容tf-idf计算..下面是我的代码谢谢......

$collection = array(
    1 => 'this string is a short string but a good string',
    2 => 'this one isn\'t quite like the rest but is here',
    3 => 'this is a different short string that\' not as short'
);

$dictionary = array();
$docCount = array();

foreach($collection as $docID => $doc) {
    $terms = explode(' ', $doc);
    $docCount[$docID] = count($terms);

    foreach($terms as $term) {
        if(!isset($dictionary[$term])) {
            $dictionary[$term] = array('df' => 0, 'postings' => array());
        }
        if(!isset($dictionary[$term]['postings'][$docID])) {
            $dictionary[$term]['df']++;
            $dictionary[$term]['postings'][$docID] = array('tf' => 0);
        }

        $dictionary[$term]['postings'][$docID]['tf']++;
    }
}

$temp = ('docCount' => $docCount, 'dictionary' => $dictionary);

计算tf-idf

$index = $temp;
$docCount = count($index['docCount']);
$entry = $index['dictionary'][$term];
foreach($entry['postings'] as  $docID => $postings) {
    echo "Document $docID and term $term give TFIDF: " .
        ($postings['tf'] * log($docCount / $entry['df'], 2));
    echo "\n";
}

Answer 1

看一下这个答案：Reading all file contents from a directory - php

在那里，您可以找到有关如何从目录中读取所有文件内容的信息有了这些信息，您应该可以修改您的代码，以使其按预期工作。

如何从php中的多个文本文件计算tf-idf？

1 个答案: