停止对某些文档进行过滤

时间:2012-08-05 03:34:00

标签: php mysql database

我很困惑如何过滤某些文件的文字。我必须逐一检查文件。例如来自tb_tokens:

======================================================================
| tokens_id | tokens_word | tokens_freq|  sentence_id |  document_id |
======================================================================
|     1     |      A      |      1     |       0      |       1      |
|     2     |      B      |      1     |       0      |       1      |
|     3     |      C      |      1     |       1      |       1      |
|     4     |      D      |      1     |       0      |       2      |
|    ...    |             |            |              |              |
======================================================================

我必须删除列表中出现的所有单词,如“and”,“the”等常用单词。列表记录在表tb_stopword中,然后删除大多数出现在大多数文档中的单词记录在tb_term表中的列表。

函数cekStopWord:

function cekStopWord ($word) {
   $query = mysql_query("SELECT stoplist_word FROM tb_stopword where stoplist_word = '$word' ");
   $row = mysql_fetch_row($query);
   if($row > 0) {
        return true;
   } else {
        return false;
   }        
}

第二个过程的类似功能(删除大多数文档中大量出现的单词)

function cekTerm ($word) {
    $query = mysql_query("SELECT term_word FROM tb_term where term_word = '$word' ");

我很困惑如何处理每个文件。我试图通过doc_id调用,但它不起作用。这是我的代码:

//$doc_id is a variable that save array of document_id
$query = mysql_query('SELECT tokens_word, sentence_id, document_id FROM tb_tokens WHERE document_id IN (' . implode(",", $doc_id) . ')') or die(mysql_error());
while ($row = mysql_fetch_array($query)) {
    $word[$row['document_id']][$row['sentence_id']] = $row['tokens_word'];
}
foreach ($word as $doc_id => $words){
    $cekStopWord = cekStopWord($words);
    $cekTerm     = cekTerm($words);
    if((preg_match("/^[A-Z, 0-9]/", $words))&& (!$cekStopWord) && (!$cekTerm) ){
          $q = mysql_query("INSERT INTO tb_tagging VALUES ('','$words','','$sentence_id','$doc_id') ");

还有如何在数组中使用preg_match? 非常感谢你:))

0 个答案:

没有答案