将文档总数除以包含词干的文档数

时间:2012-09-12 10:27:15

标签: php mysql sql

我有两张桌子:

tb_sentence

================================
|id|doc_id|sentence_id|sentence|
================================
| 1|  1   |   0       |    AB  |
| 2|  1   |   1       |    CD  |
| 3|  2   |   0       |    EF  |
| 4|  2   |   1       |    GH  |
| 5|  2   |   2       |    IJ  |
| 6|  2   |   3       |    KL  |
================================

首先,我计算每个document_id中的句子数,并将它们保存在变量$total_sentence中。 因此$total_sentence变量的值为Array ( [0] => 2 [1] => 4 )

第二个表格是tb_stem

============================
|id|stem|doc_id|sentence_id|
============================
|1 | B  |  1   |     0     |
|2 | A  |  1   |     1     |
|3 | C  |  2   |     0     |
|4 | A  |  2   |     1     |
|5 | E  |  2   |     2     |
|6 | C  |  2   |     3     |
|7 | D  |  2   |     4     |
|8 | G  |  2   |     5     |
|9 | A  |  2   |     6     |
============================

其次,我需要在stem中对doc_id的数据进行分组,然后计算由sentence_id之前的结果组成的$token的数量。该概念是将文档总数除以包含词干的文档数。 代码:

$query1 = mysql_query("SELECT DISTINCT(stem) AS unique FROM `tb_stem` group by stem,doc_id ");
while ($row = mysql_fetch_array($query1)) {
    $token = $row['unique']; //the result $token must be : ABACDEG
}

$query2 = mysql_query("SELECT stem, COUNT( DISTINCT sentence_id ) AS ndw FROM `tb_stem` WHERE stem = '$token' GROUP BY stem, doc_id");
    while ($row = mysql_fetch_array($query2)) {
        $ndw = $row['ndw']; //the result must be : 1122111
}

$idf = log($total_sentence / $ndw)+1; //$total_sentence for doc_id = 1 must be divide $ndw with the doc_id = 2, etc

但结果并不是如下表所示的不同文件之间的分开:

============================
|id|word|doc_id|  ndw |idf |
============================
|1 | A  |      |      |    |
|2 | B  |      |      |    |
|3 | C  |      |      |    |
|4 | D  |      |      |    |
|5 | E  |      |      |    |
|6 | G  |      |      |    |
============================

结果必须是:

 ============================
|id|word|doc_id|  ndw |idf |
============================
|1 | A  |   1  |      |    |
|2 | B  |   1  |      |    |
|3 | A  |   2  |      |    |
|4 | C  |   2  |      |    |
|5 | D  |   2  |      |    |
|6 | E  |   2  |      |    |
|7 | G  |   2  |      |    |
============================

请帮帮我,谢谢你:)

idf的公式为idf = log(N/df),其中N是文档编号,df是术语(t)出现的文档数。每个句子都被视为一个文件。 以下是idf计算的示例: 文件:Do you read poetry while flying. Many people find it relaxing to read on long flights

=================================================
|     Term     | Document1(D1)| D2| df |   idf  |
=================================================
|     find     |     0        | 1 |  1 |log(2/1)|
|     fly      |     1        | 1 |  2 |log(2/2)|
|     long     |     0        | 1 |  1 |log(2/1)|
|    people    |     0        | 1 |  1 |log(2/1)|
|    poetry    |     1        | 0 |  1 |log(2/1)|
|     read     |     1        | 1 |  2 |log(2/2)|
|    relax     |     0        | 1 |  1 |log(2/1)|
=================================================

1 个答案:

答案 0 :(得分:2)

此查询将为您提供所需的表格:

SELECT t1.doc_id, t2.token as word, t2.token_freq as df, 
log(t1.docs/t2.token_freq) as idf
FROM 
(SELECT doc_id,count(sentence_id) as docs from tb_sentence group by doc_id) as t1,
(SELECT DISTINCT(stem) as token, doc_id, COUNT(sentence_id) as token_freq 
      FROM tb_stem GROUP BY doc_id, token) as t2
WHERE t1.doc_id = t2.doc_id

注意:原始查询中的唯一内容是MySQL中的保留字,会给您错误。