PHP - 文本挖掘慢进程的文本预处理

时间:2017-04-03 09:08:30

标签: php arrays regex string text-mining

我使用大型数据库进行文本挖掘的文本预处理,我希望将数据库中所有文章的数据库数据转换为数组,但这需要很长时间的处理。

$multiMem   = memory_get_usage();
$xstart = microtime(TRUE);
$word = "";
$sql = mysql_query("SELECT * FROM tbl_content");
while($data = mysql_fetch_assoc($sql)){
  $word = $word."".$data['article'];
}

$preprocess = new preprocess($word);
$word= $preprocess->preprocess($word);
print_r($kata);

$xfinish = microtime(TRUE);
  

这是我的课程预处理

class preprocess {

  var $teks;

  function preprocess($teks){
  /*start process segmentation*/
  $teks = trim($teks);

  //menghapus tanda baca
  $teks = str_replace("'", "", $teks);
  $teks = str_replace("-", "", $teks);
  $teks = str_replace(")", "", $teks);
  $teks = str_replace("(", "", $teks);
  $teks = str_replace("=", "", $teks);
  $teks = str_replace(".", "", $teks);
  $teks = str_replace(",", "", $teks);
  $teks = str_replace(":", "", $teks);
  $teks = str_replace(";", "", $teks);
  $teks = str_replace("!", "", $teks);
  $teks = str_replace("?", "", $teks);

  //remove HTML tags
  $teks = strip_tags($teks);
  $teks = preg_replace('@<(\w+)\b.*?>.*?</\1>@si', '', $teks);
  /*end proses segmentation*/

  /*start case folding*/
  $teks = strtolower($teks);

  $teks = preg_replace('/[0-9]+/', '', $teks);
  /*end case folding*/

  /*start of tokenizing*/
  $teks = explode(" ", $teks);

  /*end of tokenizing*/

  /*start of filtering*/
  //stopword
  $file = file_get_contents('stopword.txt', FILE_USE_INCLUDE_PATH);
  $stopword = explode("\n", $file);

  //remove stopword
  $teks = preg_replace('/\b('.implode('|',$stopword).')\b/','',$teks);

  /*end of filtering*/

  /*start of stemming*/
  require_once('stemming.php');
  foreach($teks as $t => $value){
    $teks[$t] = stemming($value);
  }
  /*end of stemming*/

  $teks = array_filter($teks);
  $teks = array_values($teks);

  return $teks;
 }
}

任何人都有任何想法在我的程序上快速处理?请帮助
谢谢你提前

1 个答案:

答案 0 :(得分:1)

这是一些可能会改进的事情......

  1. 在构建$word后,您可以释放查询结果$sqldata

    $word = '';
    $sql = mysql_query("SELECT * FROM tbl_content");
    while($data = mysql_fetch_assoc($sql)){
      $word = $word . $data['article'];
    }
    mysql_free_result($sql);
    unset($sql, $data);
    
  2. 此块:

    $teks = str_replace("'", "", $teks);
    $teks = str_replace("-", "", $teks);
    $teks = str_replace(")", "", $teks);
    $teks = str_replace("(", "", $teks);
    $teks = str_replace("=", "", $teks);
    $teks = str_replace(".", "", $teks);
    $teks = str_replace(",", "", $teks);
    $teks = str_replace(":", "", $teks);
    $teks = str_replace(";", "", $teks);
    $teks = str_replace("!", "", $teks);
    $teks = str_replace("?", "", $teks);
    
  3. 可以这样写:

        $teks = str_replace(array('(','-',')',',','.','=',';','!','?'), '', $teks);
    
    1. 因为您稍后在代码中使用正则表达式替换数字,您可以在上部str_replace调用中添加数字,或者将上部字符添加到preg_replace

      $teks = str_replace(array('0','1','2','3','4','5','6','7','8','9','(','-',')',',','.','=',';','!','?'), '', $teks);
      

      OR

      $teks = preg_replace('/[0-9,\(\)\-\=\.\,\;\!\?]+/', '', $teks);
      
    2. $teks = strip_tags($teks);应该足够了。如果它不是那么只使用preg_replace跟随,因为它正在做同样的事情。

    3. 使用file insted file_get_contents followed by the爆炸since the文件returns an array directly. Also there is no need to explode the $teks

         $stopword = file('stopword.txt');
         array_walk($stopword, function(&$item1){
           $item1 = '/\b' . $item1 . '\b/';
         });
         $teks = preg_replace($stopword, '', $teks);
      
    4. 一般不要使用"",因为处理器会尝试评估内容并且需要更长时间。

    5. 如果stopword.txt列表没有改变,那么将代码直接作为数组直接访问文件系统进行读取会更好更快。