计算每个文档中的逆文档频率(idf)

时间:2012-08-08 01:52:17

标签: php

我想算idf,公式为IDF=log(D/df)D是总数据,df是包含搜索词的许多数据。 从表中: 1. tb_stemming

 ===========================================================================
 |stem_id | stem_before | stem_after | stem_freq | sentence_id |document_id|
 ===========================================================================
 |    1   |    Data     |    Data    |     1      |      0     |     1     |
 |    2   |   Discuss   |   Discuss  |     1      |      1     |     1     |
 |    3   |   Mining    |    Min     |     1      |      0     |     2     |
 ===========================================================================

这是代码:

countIDF($total_sentence,$doc_id);

$total_sentenceArray ( [0] => 644 [1] => 79 [2] => 264 [3] => 441 [4] => 502 [5] => 18 [6] => 352 [7] => 219 [8] => 219 )

function countIDF($total_sentence, $doc_id) {
    foreach ($total_sentence as $doc_id => $total_sentences){
       $idf = 0;
       $query1 = mysql_query("SELECT document_id, DISTINCT(stem_after) AS unique_token FROM tb_stemming group by stem_after where document_id='$doc_id'  ' ");
       while ($row = mysql_fetch_array($query)) {
           $token  = $row['unique_token'];
           $doc_id = $row['document_id'];
           $ndw    = countNDW($token);

           $idf = log($total_sentences / $ndw)+1;
           $q   = mysql_query("INSERT INTO tb_idf VALUES ('','$doc_id','$token','$ndw','$idf') ");
        }
     }
}

并且countNDW的功能是:

function countNDW ($word) {
    $query = mysql_query("SELECT stem_after, COUNT( DISTINCT sentence_id ) AS ndw FROM `tb_stemming` WHERE stem_after = '$word' GROUP BY stem_after");
    while ($row = mysql_fetch_array($query)) {
        $ndw = $row['ndw'];
    }
    return $ndw;
}

它无法正常工作,特别是在数据库调用时。我所需要的只是计入每个document_id。如何在我的代码中定义它?拜托,帮帮我..非常感谢你:)。

0 个答案:

没有答案