我有两张桌子:
tb_sentence
:
================================
|id|doc_id|sentence_id|sentence|
================================
| 1| 1 | 0 | AB |
| 2| 1 | 1 | CD |
| 3| 2 | 0 | EF |
| 4| 2 | 1 | GH |
| 5| 2 | 2 | IJ |
| 6| 2 | 3 | KL |
================================
首先,我计算每个document_id
中的句子数,并将它们保存在变量$total_sentence
中。
因此$total_sentence
变量的值为Array ( [0] => 2 [1] => 4 )
第二个表格是tb_stem
:
============================
|id|stem|doc_id|sentence_id|
============================
|1 | B | 1 | 0 |
|2 | A | 1 | 1 |
|3 | C | 2 | 0 |
|4 | A | 2 | 1 |
|5 | E | 2 | 2 |
|6 | C | 2 | 3 |
|7 | D | 2 | 4 |
|8 | G | 2 | 5 |
|9 | A | 2 | 6 |
============================
其次,我需要在stem
中对doc_id
的数据进行分组,然后计算由sentence_id
之前的结果组成的$token
的数量。该概念是将文档总数除以包含词干的文档数。
代码:
$query1 = mysql_query("SELECT DISTINCT(stem) AS unique FROM `tb_stem` group by stem,doc_id ");
while ($row = mysql_fetch_array($query1)) {
$token = $row['unique']; //the result $token must be : ABACDEG
}
$query2 = mysql_query("SELECT stem, COUNT( DISTINCT sentence_id ) AS ndw FROM `tb_stem` WHERE stem = '$token' GROUP BY stem, doc_id");
while ($row = mysql_fetch_array($query2)) {
$ndw = $row['ndw']; //the result must be : 1122111
}
$idf = log($total_sentence / $ndw)+1; //$total_sentence for doc_id = 1 must be divide $ndw with the doc_id = 2, etc
但结果并不是如下表所示的不同文件之间的分开:
============================
|id|word|doc_id| ndw |idf |
============================
|1 | A | | | |
|2 | B | | | |
|3 | C | | | |
|4 | D | | | |
|5 | E | | | |
|6 | G | | | |
============================
结果必须是:
============================
|id|word|doc_id| ndw |idf |
============================
|1 | A | 1 | | |
|2 | B | 1 | | |
|3 | A | 2 | | |
|4 | C | 2 | | |
|5 | D | 2 | | |
|6 | E | 2 | | |
|7 | G | 2 | | |
============================
请帮帮我,谢谢你:)
idf的公式为idf = log(N/df)
,其中N
是文档编号,df
是术语(t)出现的文档数。每个句子都被视为一个文件。
以下是idf计算的示例:
文件:Do you read poetry while flying. Many people find it relaxing to read on long flights
=================================================
| Term | Document1(D1)| D2| df | idf |
=================================================
| find | 0 | 1 | 1 |log(2/1)|
| fly | 1 | 1 | 2 |log(2/2)|
| long | 0 | 1 | 1 |log(2/1)|
| people | 0 | 1 | 1 |log(2/1)|
| poetry | 1 | 0 | 1 |log(2/1)|
| read | 1 | 1 | 2 |log(2/2)|
| relax | 0 | 1 | 1 |log(2/1)|
=================================================
答案 0 :(得分:2)
此查询将为您提供所需的表格:
SELECT t1.doc_id, t2.token as word, t2.token_freq as df,
log(t1.docs/t2.token_freq) as idf
FROM
(SELECT doc_id,count(sentence_id) as docs from tb_sentence group by doc_id) as t1,
(SELECT DISTINCT(stem) as token, doc_id, COUNT(sentence_id) as token_freq
FROM tb_stem GROUP BY doc_id, token) as t2
WHERE t1.doc_id = t2.doc_id
注意:原始查询中的唯一内容是MySQL中的保留字,会给您错误。