我很困惑如何过滤某些文件的文字。我必须逐一检查文件。例如来自tb_tokens:
======================================================================
| tokens_id | tokens_word | tokens_freq| sentence_id | document_id |
======================================================================
| 1 | A | 1 | 0 | 1 |
| 2 | B | 1 | 0 | 1 |
| 3 | C | 1 | 1 | 1 |
| 4 | D | 1 | 0 | 2 |
| ... | | | | |
======================================================================
我必须删除列表中出现的所有单词,如“and”,“the”等常用单词。列表记录在表tb_stopword中,然后删除大多数出现在大多数文档中的单词记录在tb_term表中的列表。
函数cekStopWord:
function cekStopWord ($word) {
$query = mysql_query("SELECT stoplist_word FROM tb_stopword where stoplist_word = '$word' ");
$row = mysql_fetch_row($query);
if($row > 0) {
return true;
} else {
return false;
}
}
第二个过程的类似功能(删除大多数文档中大量出现的单词)
function cekTerm ($word) {
$query = mysql_query("SELECT term_word FROM tb_term where term_word = '$word' ");
我很困惑如何处理每个文件。我试图通过doc_id调用,但它不起作用。这是我的代码:
//$doc_id is a variable that save array of document_id
$query = mysql_query('SELECT tokens_word, sentence_id, document_id FROM tb_tokens WHERE document_id IN (' . implode(",", $doc_id) . ')') or die(mysql_error());
while ($row = mysql_fetch_array($query)) {
$word[$row['document_id']][$row['sentence_id']] = $row['tokens_word'];
}
foreach ($word as $doc_id => $words){
$cekStopWord = cekStopWord($words);
$cekTerm = cekTerm($words);
if((preg_match("/^[A-Z, 0-9]/", $words))&& (!$cekStopWord) && (!$cekTerm) ){
$q = mysql_query("INSERT INTO tb_tagging VALUES ('','$words','','$sentence_id','$doc_id') ");
还有如何在数组中使用preg_match? 非常感谢你:))